Predicting Spotify Songs Popularity¶

Author: Polina Adamovich

Data source: Tidy Tuesday project on Github

Table of contents:¶

  • A. EDA¶

  • B. Clustering¶

  • C. Models: Fitting & Interpretation¶

  • D. Models: Predictions¶

  • E. Models: Performance & Validation¶

A. EDA¶

Import Modules¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

Read in the data¶

In [2]:
songs_url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'

df = pd.read_csv( songs_url )

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32833 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32833 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32833 non-null  int64  
 4   track_album_id            32833 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32833 non-null  object 
 7   playlist_name             32833 non-null  object 
 8   playlist_id               32833 non-null  object 
 9   playlist_genre            32833 non-null  object 
 10  playlist_subgenre         32833 non-null  object 
 11  danceability              32833 non-null  float64
 12  energy                    32833 non-null  float64
 13  key                       32833 non-null  int64  
 14  loudness                  32833 non-null  float64
 15  mode                      32833 non-null  int64  
 16  speechiness               32833 non-null  float64
 17  acousticness              32833 non-null  float64
 18  instrumentalness          32833 non-null  float64
 19  liveness                  32833 non-null  float64
 20  valence                   32833 non-null  float64
 21  tempo                     32833 non-null  float64
 22  duration_ms               32833 non-null  int64  
dtypes: float64(9), int64(4), object(10)
memory usage: 5.8+ MB

a. Basic information¶

Show the number of rows and columns¶
In [3]:
df.shape
Out[3]:
(32833, 23)

The df DataFrame has 32,833 rows and 23 columns.

Display the variable names and their associated data types¶
In [4]:
df.dtypes
Out[4]:
track_id                     object
track_name                   object
track_artist                 object
track_popularity              int64
track_album_id               object
track_album_name             object
track_album_release_date     object
playlist_name                object
playlist_id                  object
playlist_genre               object
playlist_subgenre            object
danceability                float64
energy                      float64
key                           int64
loudness                    float64
mode                          int64
speechiness                 float64
acousticness                float64
instrumentalness            float64
liveness                    float64
valence                     float64
tempo                       float64
duration_ms                   int64
dtype: object
Display the number of missing values for each variable¶
In [5]:
df.isna().sum()
Out[5]:
track_id                    0
track_name                  5
track_artist                5
track_popularity            0
track_album_id              0
track_album_name            5
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64

There are three variables that have missing values: track_name, track_artist and track_album_name. Each of these variables is missing 5 values.

Display the number of unique values for each variable¶
In [6]:
df.nunique()
Out[6]:
track_id                    28356
track_name                  23449
track_artist                10692
track_popularity              101
track_album_id              22545
track_album_name            19743
track_album_release_date     4530
playlist_name                 449
playlist_id                   471
playlist_genre                  6
playlist_subgenre              24
danceability                  822
energy                        952
key                            12
loudness                    10222
mode                            2
speechiness                  1270
acousticness                 3731
instrumentalness             4729
liveness                     1624
valence                      1362
tempo                       17684
duration_ms                 19785
dtype: int64

Cleaning the dataset¶

According to the Tidy Tuesday github page track_id is a unique song ID. But we can see that there are less unique values for track_id than rows in the dataset.

In [7]:
df.track_id.value_counts().value_counts()
Out[7]:
count
1     25190
2      2384
3       510
4       142
5        60
6        35
7        17
8        15
9         2
10        1
Name: count, dtype: int64

We can see that some track ids have from 2 to 10 rows associated with them. Further exploration (not shown in this notebook for brevity) reveals that each row represents a track from an album within a specific playlist - meaning duplicates arise from the same track appearing in multiple playlists.

Now I will check whether the duplication of the track_id is associated with changes in the output, track_popularity, and inputs of interest. We group the data by track_id and count the number of unique values for each feature of interest - including both the target track_popularity and the input features we might use in our model.

In [8]:
df.groupby(['track_id']).\
aggregate(num_track_pop_values = ('track_popularity', 'nunique'),
          num_playlist_genre_values = ('playlist_genre', 'nunique'),
          num_playlist_subgenre_values = ('playlist_subgenre', 'nunique'),
          num_danceability_values = ('danceability', 'nunique'),
          num_energy_values = ('energy', 'nunique'),
          num_key_values = ('key', 'nunique'),
          num_loudness_values = ('loudness', 'nunique'),
          num_mode_values = ('mode', 'nunique'),
          num_speechiness_values = ('speechiness', 'nunique'),
          num_acousticness_values = ('acousticness', 'nunique'),
          num_instrumentalness_values = ('instrumentalness', 'nunique'),
          num_liveness_values = ('liveness', 'nunique'),
          num_valence_values = ('valence', 'nunique'),
          num_tempo_values = ('tempo', 'nunique'),
          num_duration_values = ('duration_ms', 'nunique')).\
reset_index().\
nunique()
Out[8]:
track_id                        28356
num_track_pop_values                1
num_playlist_genre_values           5
num_playlist_subgenre_values       10
num_danceability_values             1
num_energy_values                   1
num_key_values                      1
num_loudness_values                 1
num_mode_values                     1
num_speechiness_values              1
num_acousticness_values             1
num_instrumentalness_values         1
num_liveness_values                 1
num_valence_values                  1
num_tempo_values                    1
num_duration_values                 1
dtype: int64

We observe that some tracks appear with up to 5 different playlist_genre values and up to 10 different playlist_subgenre values. This is expected, since the same track can be included in multiple playlists with different themes or categorizations.

To ensure each row in the dataset represents a unique track, I chose to filter the original dataset to include only tracks that appeared once and only once and created a new dataset df_clean.

In [9]:
track_counts = df.track_id.value_counts()
unique_track_ids = track_counts[track_counts == 1].index
df_clean = df[df.track_id.isin(unique_track_ids)]

df_clean.shape
Out[9]:
(25190, 23)
In [10]:
df.shape
Out[10]:
(32833, 23)
In [11]:
(32833 - 25190) / 32833
Out[11]:
0.23278408917857035

I chose to remove all duplicate tracks, which reduced the dataset size by about 23%. The main benefit was making the data much simpler to work with, as I didn't have to figure out how to handle cases where the same song appeared in conflicting playlists (for example, a song listed in both a pop and a rock playlist). The primary cost, however, is the reduction in data size, which could slightly decrease the model's statistical power. I determined that the value of a simpler, more interpretable dataset was worth the trade-off of a smaller sample size.

I'm treating the variable mode as a non-numeric column because it has low number of unique values - 2 - and because those numbers represent categories, not quantities. I will also treat column key as a non-numeric column for the same reasons.

In [12]:
df_copy=df_clean.copy()

df_copy['key'] = df_copy['key'].astype('object')
df_copy['mode'] = df_copy['mode'].astype('object')

Here I transformed the original target variable track_popularity into an object type variable binary_outcome which will have 2 unique values: 1 if the song's popularity is over 50, and 0 overwise.

In [13]:
df_copy['binary_outcome'] = np.where( df_copy.track_popularity > 50, 1, 0 )
df_copy['binary_outcome'] = df_copy['binary_outcome'].astype('object')

b. Visualization¶

1. Counts of categorical variables¶
In [14]:
sns.catplot(data=df_copy, x='playlist_genre', kind='count', hue='playlist_genre', aspect=1)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The bar chart above shows the distribution of songs across six different playlist genres in the cleaned dataset. While the genres are relatively balanced, rap stands out as the most represented genre.

In [15]:
sns.catplot(data=df_copy, x='mode', kind='count', hue='mode')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

While the distribution of songs across the two modes (1 - major, 0 - minor) is also relatively balanced, mode 1 is slightly more prevalent.

In [16]:
sns.catplot(data=df_copy, x='key', kind='count', hue='key')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The distribution of songs across the 12 musical keys is relatively balanced, with the exception of key 3, which has the fewest songs. Keys 0, 1, and 7 are the most common in the cleaned dataset.

In [17]:
sns.catplot(data=df_copy, x='binary_outcome', kind='count', hue='binary_outcome')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

This bar chart shows that our cleaned dataset contains roughly twice as many unpopular songs (where track_popularity is lower than 50) as popular ones.

In [18]:
df_copy.binary_outcome.value_counts(normalize=True)
Out[18]:
binary_outcome
0    0.672092
1    0.327908
Name: proportion, dtype: float64

More than 67% of the songs in the dataset are classified as unpopular (using a 50% threshold)!

2. Distributions of continuous variables¶

To visualize marginal distributions for all continuous variables at once, I will first transform the database into the long format, which is preferred by Seaborn, and call it df_lf.

In [19]:
df_features = df_copy.select_dtypes('number').copy()
In [20]:
df_objects = df_copy.select_dtypes('object').copy()
In [21]:
id_cols = ['rowid'] + df_objects.columns.to_list()
In [22]:
df_lf = df_copy.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)

I will now create density plots for all continuous variables.

In [23]:
sns.displot(data=df_lf, x='value', col='variable', kind='kde', col_wrap = 3,
            facet_kws={'sharex':False, 'sharey':False},
            common_norm=False)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

None of the distributions appear perfectly symmetric or unimodal - several exhibit multiple peaks, while others have long tails, indicating skewness. This suggests that the continuous variables deviate from a Gaussian distribution. Among them, the distributions of danceability, valence, energy, and duration are the most Gaussian-like.

3. Relationships between continuous variables¶

The pairsplot below shows the relationship between all continious variables.

In [24]:
sns.pairplot(data=df_copy, x_vars=['track_popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
             y_vars=['track_popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
             diag_kind='kde',
             diag_kws={'common_norm': False})

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

There is a positive correlation between loudness and energy. Aside from this, the other scatterplots do not show any obvious linear or non-linear relationships, with the points generally appearing as unstructured clouds.

In [25]:
fig, ax = plt.subplots()

sns.heatmap(data = df_copy.loc[:, ['track_popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']].corr(),
            vmin=-1, vmax=1, center = 0,
            cmap='coolwarm',
            annot=True, annot_kws={'size': 7},
            ax=ax)

plt.show()
No description has been provided for this image

The correlation plot above helps to see the relationships between continuous variables more clearly. It shows a somewhat strong positive correlation between loudness and energy and moderate negative correlation between acousticness and energy. Most other variable pairs show weak correlations, both positive and negative. The target variable, track_popularity, is only weakly correlated with other features - showing a slight positive correlation with acousticness and slight negative correlations with energy, instrumentalness, and duration_ms.

4. Summaries of the continuous variables grouped by categorical variables¶

First, I will study distributions of all continuous variables grouped by playlist_genre.

In [26]:
sns.displot(data=df_lf, x='value', col='variable', col_wrap=3, kind='kde',
            hue='playlist_genre',
            facet_kws={'sharex': False, 'sharey': False},
            common_norm=False,
            palette='bright')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The distributions of continuous variables such as danceability, energy, speechiness, acousticness, instrumentalness, valence, tempo, and duration_ms vary substantially across different playlist genres.

The plot below shows distributions of all continuous variables grouped by key.

In [27]:
sns.displot(data=df_lf, x='value', col='variable', col_wrap=3, kind='kde',
            hue='key',
            facet_kws={'sharex': False, 'sharey': False},
            common_norm=False,
            palette='bright')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

While there are some subtle shifts, the distributions of the continuous variables are largely consistent across the different musical keys. This suggests that musical key is not a primary factor influencing these audio features.

The plot below shows distributions of all continuous variables grouped by mode.

In [28]:
sns.displot(data=df_lf, x='value', col='variable', col_wrap=3, kind='kde',
            hue='mode',
            facet_kws={'sharex': False, 'sharey': False},
            common_norm=False)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The distributions of continuous variables appear largely similar across the two categories of mode. This suggests that mode may not have a strong influence on the distributions of these continuous features.

I will now create a series of point plots to explore how the means of all continuous variables vary across 3 categorical variables (playlist_genre, mode, and key). Each facet represents a different continuous variable, the x-axis shows the categories of a selected categorical variable, and y-axis shows values of the continuous variable within each facet.

In [29]:
sns.catplot(data = df_lf, x='playlist_genre', y='value',col='variable', col_wrap=3, kind='point', hue='playlist_genre',
            sharey=False, join=False)

plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\3624792511.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data = df_lf, x='playlist_genre', y='value',col='variable', col_wrap=3, kind='point', hue='playlist_genre',
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The point plots above suggest meaningful differences in the mean values of continuous variables across some playlist genres. In particular, the facet showing the target variable track_popularity reveals that songs in pop, rap, rock, and Latin playlists tend to have higher average popularity, with overlapping confidence intervals. In contrast, songs in edm playlists have a significantly lower mean popularity, with a non-overlapping confidence interval - indicating a likely true difference in average popularity compared to the other genres.

In [30]:
sns.catplot(data = df_lf, x='key', y='value',col='variable', col_wrap=3, kind='point', hue='key',
            sharey=False, join=False)

plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\1088206355.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data = df_lf, x='key', y='value',col='variable', col_wrap=3, kind='point', hue='key',
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The point plots above suggest that there are meaningful differences in the means of some continuous variables grouped by key. Focusing on the facet showing the target variable track_popularity, we can see that the plot suggests some meaninful differences in the average popularity across different keys. For example, songs in key 8 have higher average popularity than songs in keys 2, 6, 7, and 12.

In [31]:
sns.catplot(data = df_lf, x='mode', y='value',col='variable', col_wrap=3, kind='point', hue='mode',
            sharey=False, join=False)

plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\1594811114.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data = df_lf, x='mode', y='value',col='variable', col_wrap=3, kind='point', hue='mode',
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The point plot above shows meaningful differences in the average values of track popularity, danceability, loudness, speechiness, and tempo depending on the mode of the song. Focusing on the facet for track_popularity, songs in mode 1 have higher average popularity than songs in mode 0.

In [32]:
sns.catplot(data = df_lf, x='binary_outcome', y='value',col='variable', col_wrap=3, kind='point', hue='binary_outcome',
            sharey=False, join=False)

plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\899696440.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data = df_lf, x='binary_outcome', y='value',col='variable', col_wrap=3, kind='point', hue='binary_outcome',
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The point plot reveals statistically significant differences in the mean values of several audio features between popular and unpopular songs. Specifically, popular songs tend to have higher danceability, loudness, acousticness, while showing lower average valence and lower energy, instrumentalness', 'liveness', and 'duration_ms.

4. Histograms and relationships between continuous inputs broken up by the outcome unique values¶

The plot below shows the distributions of all continuous variables, grouped by binary_outcome, along the diagonal, as well as relationships between input continuous variables in the off-diagonal plots, also grouped by binary_outcome.

In [33]:
sns.pairplot(data=df_copy, x_vars=['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
             y_vars=['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
             hue='binary_outcome',
             diag_kind='kde',
             diag_kws={'common_norm': False})

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Grouping the relationships between continuous variables by binary_outcome did not reveal any clear patterns or separation between the two classes. Across all scatterplots, the data points for both outcomes appear largely overlapping, suggesting little visual distinction between them.

The histograms below show the distribution of each continuous input variable separately for the two categories of the binary_outcome.

In [34]:
sns.displot(data=df_copy, x='danceability', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [35]:
sns.displot(data=df_copy, x='energy', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [36]:
sns.displot(data=df_copy, x='loudness', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [37]:
sns.displot(data=df_copy, x='speechiness', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [38]:
sns.displot(data=df_copy, x='acousticness', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [39]:
sns.displot(data=df_copy, x='instrumentalness', col='binary_outcome', kind='hist', bins=50)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [40]:
sns.displot(data=df_copy, x='liveness', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [41]:
sns.displot(data=df_copy, x='valence', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [42]:
sns.displot(data=df_copy, x='tempo', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [43]:
sns.displot(data=df_copy, x='duration_ms', col='binary_outcome', kind='hist')

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The shapes of the distributions appear similar across the two outcome groups, suggesting that none of these continuous variables are strongly predictive of the outcome on their own. The apparent difference in the height of the histograms is due to the fact that category 0 has approximately twice as many observations as category 1. Since the default histograms display raw counts, the plot for category 0 naturally appears taller, even though the underlying distribution shapes are comparable.

5. Count the number of observations for each combination of outcome unique value and the categorical input unique values¶

The dodged bar chart below shows how the 6 playlist genres are distributed across the 2 outcomes. The plot shows that most popular songs in the dataset are in the rap category and that edm has the biggest number of unpopular and the smallest number of popular songs.

In [44]:
sns.catplot(data=df_copy, x='playlist_genre', hue='binary_outcome', kind='count', aspect=1)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The dodged bar chart below shows how the 12 musical keys are distributed across the 2 outcomes. We can see that songs in key 1 have the highest number of unpopular songs and the highest number of popular songs.

In [45]:
sns.catplot(data=df_copy, x='key', hue='binary_outcome', kind='count', aspect=1)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The dodged bar chart below shows how the 2 modes are distributed across the 2 outcomes. Songs in mode 1 have the highest number of popular and the highest number of unpopular songs.

In [46]:
sns.catplot(data=df_copy, x='mode', hue='binary_outcome', kind='count', aspect=1)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

In summary, the EDA revealed that while playlist genres, modes, and keys are relatively balanced in the cleaned dataset, rap is the most represented genre, mode 1 is slightly more common, and keys 0, 1, and 7 dominate, with key 3 being least frequent. The dataset contains about twice as many unpopular as popular songs. Continuous variables generally deviate from a Gaussian distribution, though danceability, valence, energy, and duration are closest to normality; loudness and energy show a strong positive correlation, while acousticness and energy are moderately negatively correlated. Differences in audio features emerge across genres, keys, and modes - pop, rap, rock, and Latin genres tend to have higher popularity, while edm has the lowest. Songs in certain keys (e.g., key 8) and mode 1 show higher average popularity. Popular songs tend to have higher danceability, loudness, and acousticness but lower valence, energy, instrumentalness, liveness, and duration. However, scatterplots grouped by outcome show substantial overlap between classes, suggesting that no single continuous variable is strongly predictive of popularity on its own. Finally, no obvious linear or nonlinear relationship was observed between the continous input variables and track_popularity (see Supporting - EDA for more).

B. Clustering¶

This time, I selected danceability, valence, and speechiness as input variables for clustering. This combination captures diverse aspects of the tracks making it a strong candidate for identifying meaningful groupings.

In [47]:
df_cluster_vars = df_copy[['danceability', 'speechiness', 'valence']].copy()
In [48]:
df_cluster_vars.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25190 entries, 3 to 32832
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   danceability  25190 non-null  float64
 1   speechiness   25190 non-null  float64
 2   valence       25190 non-null  float64
dtypes: float64(3)
memory usage: 1.3 MB

Preprocessing¶

The distributions of danceability and valence are approximately Gaussian and require no transformation. However, speechiness is right-skewed, so I applied a log transformation to reduce the influence of extreme values. There is no strong correlation among these three variables which helps ensure that each variable contributes uniquely to the clustering process. None of these three variables have missing values.

In [49]:
df_cluster_vars['speechiness'] = np.log(df_cluster_vars.speechiness + 0.01)

df_cluster_vars.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25190 entries, 3 to 32832
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   danceability  25190 non-null  float64
 1   speechiness   25190 non-null  float64
 2   valence       25190 non-null  float64
dtypes: float64(3)
memory usage: 1.3 MB
In [50]:
sns.displot(data = df_copy, x='speechiness', kind='hist', kde=True)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [51]:
sns.displot(data = df_cluster_vars, x='speechiness', kind='hist', kde=True)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Next, I'm going to check the scales of the three variables. The boxplot below shows that the magnitude and scale are dominated by speechiness.

In [52]:
sns.catplot(data=df_cluster_vars, kind='box', aspect=1.5)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [53]:
from sklearn.preprocessing import StandardScaler
In [54]:
X = StandardScaler().fit_transform(df_cluster_vars) 
In [55]:
sns.catplot( data=pd.DataFrame(X, columns=df_cluster_vars.columns), kind='box', aspect=1.5)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Standardization is complete!

Since the selected features are not highly correlated, I chose to cluster on the original variables rather than apply dimensionality reduction with PCA. This preserves the original interpretability of the variables, which would be lost if transformed into principal components.

As at the proposal stage, I used KMeans for clustering. While hierarchical clustering is an alternative, it is typically limited to around 15,000 observations, whereas my cleaned dataset contains 25,190. KMeans is therefore more appropriate for the size of my data.

In [56]:
from sklearn.cluster import KMeans
In [57]:
df_cluster_vars_copy = df_cluster_vars.copy()

Identifying the optimal number of clusters¶

In [58]:
tots_within = []

K = range(1, 31)

for k in K:
    km = KMeans(n_clusters=k, random_state=121, n_init=25, max_iter=500)
    km = km.fit( X )

    tots_within.append(km.inertia_)
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning: Could not find the number of physical cores for the following reason:
[WinError 2] The system cannot find the file specified
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
  warnings.warn(
  File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
  File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\subprocess.py", line 1327, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
In [59]:
fig, ax = plt.subplots()

ax.plot( K, tots_within, 'bo-' )
ax.set_xlabel('number of clusters')
ax.set_ylabel('total within sum of squares')

plt.show()
No description has been provided for this image

The Knee Bend shows that the total within-cluster sum of squares begins to level off around 5 clusters. While the curve doesn't become completely flat beyond that point, the rate of decrease seem to slow significantly.

Running KMeans for the optimal number of clusters¶

In [60]:
clusters_5 = KMeans(n_clusters=5, random_state=121, n_init=25, max_iter=500).fit_predict( X )
In [61]:
df_cluster_vars_copy['k5'] = pd.Series(clusters_5, index=df_cluster_vars_copy.index).astype('category')
In [62]:
df_cluster_vars_copy.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25190 entries, 3 to 32832
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   danceability  25190 non-null  float64 
 1   speechiness   25190 non-null  float64 
 2   valence       25190 non-null  float64 
 3   k5            25190 non-null  category
dtypes: category(1), float64(3)
memory usage: 1.3 MB
In [63]:
df_cluster_vars_copy.k5.value_counts()
Out[63]:
k5
2    7132
0    5961
1    4725
4    4349
3    3023
Name: count, dtype: int64

Visualizing clustering results¶

In [64]:
sns.pairplot(data = df_cluster_vars_copy, hue='k5', diag_kws={'common_norm': False})

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Cluster-wise Summary and Distribution of Danceability, Speechiness, and Valence¶

Danceability distribution across 5 clusters¶

In [65]:
sns.catplot(data=df_cluster_vars_copy, x='k5', y='danceability', kind='violin', hue='k5', aspect=1.5)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Cluster 4 includes the most danceable songs, showing the highest median and a relatively tight distribution, though it overlaps somewhat with Clusters 0 and 2, which fall into an intermediate-to-high danceability range. These intermediate clusters share a similar range and show overlap with both Cluster 4 at the high end and Cluster 3, which occupies an intermediate range.

Cluster 1 stands out as the least danceable group, with the lowest median and minimal overlap with the more danceable Cluster 3. Cluster 3 bridges the gap between the intermediate and low-danceability groups, overlapping significantly with Clusters 0 and 2, and only slightly with Cluster 1.

While the clusters are not perfectly distinct, the contrast between the high median in Cluster 4 and the low median in Cluster 1 suggests that danceability meaningfully separates the most and least danceable groups. Additionally, the long lower tails seen in Clusters 1, 2, 3, and 4 indicate greater variability and some skewness in danceability values, especially toward the lower end of the scale.

Speechiness distribution across 5 clusters¶

In [66]:
sns.catplot(data=df_cluster_vars_copy, x='k5', y='speechiness', kind='violin', hue='k5', aspect=1.5)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Clusters 0, 1, and 2 share similarly low median values, closely aligned distribution shapes, and fully overlapping violins, forming a clear low-speechiness group. In contrast, Clusters 3 and 4 exhibit higher, comparable medians and substantial overlap, suggesting a high-speechiness group.

Notably, the two groups - Clusters 0, 1, and 2 and Clusters 3 and 4 - show no overlap in their interquartile ranges, indicating a clear separation between low and high speechiness clusters.

Cluster 1 stands out slightly within the low-speechiness group due to its long tails on both ends, indicating greater variability in speechiness values.

Valence distribution across 5 clusters¶

In [67]:
sns.catplot(data=df_cluster_vars_copy, x='k5', y='valence', kind='violin', hue='k5', aspect=1.5)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Clusters 0, 1, and 3 have similar medians and substantially overlapping interquartile ranges, forming a low-valence group - that is, clusters with songs that tend to sound more negative.

In contrast, Clusters 2 and 4 have higher median values and significantly overlapping interquartile ranges, indicating a higher-valence group associated with more positive-sounding songs.

The slight overlap between Clusters 3 and 4 suggests that Cluster 3 shares some valence characteristics with the higher-valence group, despite generally aligning with lower-valence clusters. Additionally, the long upward tails of Clusters 1 and 3 and the long downward tail of Cluster 4 reflect greater variability in valence values at the distribution extremes.

In summary, Clusters 0, 1, and 2 share low speechiness, with similar medians and overlapping distributions, while Clusters 3 and 4 form a distinct high-speechiness group. For danceability, Cluster 4 is the most danceable, Cluster 1 the least, and the others fall in between with overlapping ranges. Valence separates into a low group (Clusters 0, 1, and 3) and a high group (Clusters 2 and 4), with slight overlap between Clusters 3 and 4. Cluster 4 consistently stands out with high values across all three features, while Cluster 1 is low on all three and shows the greatest variability.

Comparing the cluster assignments to unique values of categorical inputs¶

In [68]:
df_cluster_vars_copy['playlist_genre'] = df_copy.playlist_genre
df_cluster_vars_copy['playlist_subgenre'] = df_copy.playlist_subgenre
df_cluster_vars_copy['key'] = df_copy.key
df_cluster_vars_copy['mode'] = df_copy['mode']

paylist_genre vs k5¶

In [69]:
fig, ax = plt.subplots()

sns.heatmap(data = pd.crosstab( df_cluster_vars_copy.playlist_genre, df_cluster_vars_copy.k5, margins=True ), 
            annot=True, annot_kws={"fontsize": 15}, fmt='g',
            cbar=True,
            ax=ax)

plt.show()
No description has been provided for this image
In [70]:
sns.catplot(data = df_cluster_vars_copy, x='playlist_genre', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The heatmap and the dodged bar chart above show that while all clusters include a mix of genres, certain clusters have stronger associations with specific genres. For example, Cluster 1 is heavily represented by rock tracks, and Cluster 4 has the highest concentration of rap songs. Cluster 2, the largest group, appears relatively balanced across genres, while Cluster 0 leans more toward EDM and pop. Cluster 3 has a moderate concentration of rap and r&b. Overall, genre does not strictly define clusters, but some genre-based patterns are noticeable.

key vs k5¶

In [71]:
fig, ax = plt.subplots()

sns.heatmap(data = pd.crosstab( df_cluster_vars_copy.key, df_cluster_vars_copy.k5, margins=True ), 
            annot=True, annot_kws={"fontsize": 13}, fmt='g',
            cbar=True,
            ax=ax)

plt.show()
No description has been provided for this image
In [72]:
sns.catplot(data = df_cluster_vars_copy, x='key', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Similarly to the previous plots, the heatmap and the dodged bar chart above show that all clusters contain songs across a variety of musical keys. While some keys appear slightly more frequently, there is no strong or exclusive association between any particular key and any cluster. This suggests that musical key is not a major factor driving the clustering structure.

mode vs k5¶

In [73]:
fig, ax = plt.subplots()

sns.heatmap(data = pd.crosstab( df_cluster_vars_copy['mode'], df_cluster_vars_copy.k5, margins=True ), 
            annot=True, annot_kws={"fontsize": 13}, fmt='g',
            cbar=True,
            ax=ax)

plt.show()
No description has been provided for this image
In [74]:
sns.catplot(data = df_cluster_vars_copy, x='mode', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Similarly to the other visualizations, the heatmap and the dodged bar chart above show that all clusters contain a mix of both major and minor songs. Clusters 3 and 4 have a relatively balanced distribution of modes, while Clusters 0, 1, and 2 show a slight dominance of major songs (mode = 1). This suggests that modality is not a key factor influencing the clustering structure.

Comparing the cluster assignments to unique values of the outcome¶

In [75]:
df_cluster_vars_copy['binary_outcome'] = df_copy.binary_outcome
In [76]:
fig, ax = plt.subplots()

sns.heatmap(data = pd.crosstab( df_cluster_vars_copy.binary_outcome, df_cluster_vars_copy.k5, margins=True ), 
            annot=True, annot_kws={"fontsize": 15}, fmt='g',
            cbar=True,
            ax=ax)

plt.show()
No description has been provided for this image
In [77]:
sns.catplot(data = df_cluster_vars_copy, x='binary_outcome', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The heatmap and the dodged bar chart above show that all clusters contain a roughly similar proportion of songs with popularity above and below 50. This suggests that the binary popularity outcome is not a major factor driving the clustering structure.

In summary, the clustering grouped songs into distinct profiles based on their danceability, valence, and speechiness. The key finding is that these clusters did not separate popular from unpopular songs, which reinforces our EDA's conclusion that no single audio feature is strongly predictive on its own. The analysis also revealed noticeable genre-based patterns within the clusters, supporting the EDA's finding that audio features differ across genres and suggesting this relationship is important to explore in the modeling stage.

C. Models: Fitting and Interpretation¶

Preprocessing¶

In Section B (EDA), we saw that the distributions of the continuous input variables loudness, speechiness, instrumentalness, and livenessare not Gaussian-like. To address this, I applied a natural log transformation to speechiness, instrumentalness, and liveness. Because loudness includes negative values, neither the natural log nor square root transformation is applicable. Instead, I applied a cube root transformation to loudness, which works well with negative values and helps reduce skew.

Although the distribution of tempo shows multiple peaks and is not perfectly symmetric, it is not strongly skewed. Therefore, I used it in its original form. The remaining continuous input variables were already approximately Gaussian-like and did not require transformation.

In [78]:
numeric_vars = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']

df_numeric_inputs = df_copy[numeric_vars].copy()
In [79]:
df_numeric_inputs.describe()
Out[79]:
danceability energy loudness speechiness acousticness instrumentalness liveness valence tempo duration_ms
count 25190.000000 25190.000000 25190.000000 25190.000000 25190.000000 25190.000000 25190.000000 25190.000000 25190.000000 25190.000000
mean 0.652559 0.697178 -6.896591 0.108713 0.179709 0.095998 0.191432 0.510365 120.981311 226843.724732
std 0.146274 0.185624 3.068494 0.103691 0.225940 0.238703 0.157258 0.235183 26.994157 61983.810649
min 0.000000 0.000175 -46.448000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4000.000000
25% 0.560000 0.577000 -8.412000 0.041000 0.014100 0.000000 0.092500 0.328000 99.976000 187539.500000
50% 0.669000 0.721000 -6.345500 0.062700 0.080200 0.000025 0.127000 0.512000 121.994000 217200.000000
75% 0.760000 0.844000 -4.764000 0.135000 0.265000 0.008108 0.249000 0.695000 134.057000 255413.000000
max 0.983000 1.000000 1.275000 0.918000 0.994000 0.994000 0.996000 0.991000 239.440000 517810.000000
In [80]:
df_numeric_inputs['loudness'] = np.cbrt(df_numeric_inputs.loudness)
df_numeric_inputs['speechiness'] = np.log(df_numeric_inputs.speechiness + 0.01)
df_numeric_inputs['acousticness'] = np.log(df_numeric_inputs.acousticness + 0.01)
df_numeric_inputs['instrumentalness'] = np.log(df_numeric_inputs.instrumentalness + 0.01)
df_numeric_inputs['liveness'] = np.log(df_numeric_inputs.liveness + 0.01)
In [81]:
df_numeric_inputs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25190 entries, 3 to 32832
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      25190 non-null  float64
 1   energy            25190 non-null  float64
 2   loudness          25190 non-null  float64
 3   speechiness       25190 non-null  float64
 4   acousticness      25190 non-null  float64
 5   instrumentalness  25190 non-null  float64
 6   liveness          25190 non-null  float64
 7   valence           25190 non-null  float64
 8   tempo             25190 non-null  float64
 9   duration_ms       25190 non-null  int64  
dtypes: float64(9), int64(1)
memory usage: 2.6 MB
In [82]:
sns.catplot( data=df_numeric_inputs, kind='box', aspect=2)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The boxplot above shows that the magnitude and scale are dominated by duration_ms.

In [83]:
df_numeric_scaled = pd.DataFrame(StandardScaler().fit_transform(df_numeric_inputs),
                                 columns=df_numeric_inputs.columns, 
                                 index=df_numeric_inputs.index)
In [84]:
sns.catplot(data=df_numeric_scaled, kind='box', aspect=2)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Standardization is complete!

Creating a dataset with preprocessed continuous input variables, categorical input variables, and binary_outcome¶

In [85]:
df_modeling = df_numeric_scaled.copy()
df_modeling['playlist_genre'] = df_copy.playlist_genre
df_modeling['key'] = df_copy.key
df_modeling['mode'] = df_copy['mode']
df_modeling['binary_outcome'] = df_copy.binary_outcome.astype(int)
In [86]:
df_modeling.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25190 entries, 3 to 32832
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      25190 non-null  float64
 1   energy            25190 non-null  float64
 2   loudness          25190 non-null  float64
 3   speechiness       25190 non-null  float64
 4   acousticness      25190 non-null  float64
 5   instrumentalness  25190 non-null  float64
 6   liveness          25190 non-null  float64
 7   valence           25190 non-null  float64
 8   tempo             25190 non-null  float64
 9   duration_ms       25190 non-null  float64
 10  playlist_genre    25190 non-null  object 
 11  key               25190 non-null  object 
 12  mode              25190 non-null  object 
 13  binary_outcome    25190 non-null  int32  
dtypes: float64(10), int32(1), object(3)
memory usage: 3.3+ MB

Defining a list of formulas for eight models¶

I will fit and evaluate eight distinct logistic regression models to predict song popularity. This set includes six standard models required by the project instructions, progressing from a simple baseline to complex interaction models.

In addition, I will test two custom models. The first (Model 7) is directly motivated by my EDA, which revealed that the distributions of key audio features like danceability and energy vary significantly across playlist genres. This model tests the hypothesis that the "recipe" for a popular song is different for each genre.

The second (Model 8) is an exploratory model developed after finding no clear linear or U-shaped patterns in my initial scatter plots. This model tests a different non-linear hypothesis: that the relationship between a song's popularity and its tempo is not linear, but cyclical. This theory explores the idea that certain BPM ranges, like those ideal for activities such as dancing or driving, could be more popular than others.

Thus, my list includes the following models:

  • Model 1: intercept-only or constant average model
  • Model 2: categorical inputs with additive features
  • Model 3: continuous inputs with linear additive features
  • Model 4: all inputs (continuous and categorical) with linear additive features
  • Model 5: continuous inputs with linear main effect and pair-wise interactions
  • Model 6: interaction between categorical and continuous inputs (including main effects)
  • Model 7: main effects plus the interaction between playlist_genre and a targeted subset of features
  • Model 8: cyclical (sine) tempo feature interacting with with playlist_genre interaction, plus main effects of other key audio features
In [87]:
formula_list = ['binary_outcome ~ 1', # intercept-only model
                'binary_outcome ~ playlist_genre + key + mode', # categorical inputs with additive features
                'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms', # continuous inputs with linear additive features
                'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode', # all inputs (continuous and categorical) with linear additive features
                'binary_outcome ~ (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)**2', # continuous inputs with linear main effect and pair-wise interactions
                'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)', # interaction between categorical and continuous inputs (including main effects)
                'binary_outcome ~ playlist_genre * (danceability + energy + speechiness + instrumentalness + valence + tempo)', # interaction between playlist_genre and a targeted subset of continuous features
                'binary_outcome ~ playlist_genre * np.sin(tempo) + danceability + energy + acousticness'] # cyclical (sine) tempo feature interacting with with playlist_genre interaction, plus main effects of other key audio features

Fitting the models, checking their coefficients, and showing the performance on the training set¶

In [88]:
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

I will define a single function to streamline the analysis. For any given model, this function will automatically fit the model, analyze its coefficients (number, significance, and magnitude), and calculate all required performance metrics (Accuracy, Sensitivity, Precision, Specificity, FPR, F1 score, and ROC AUC).

In [89]:
def fit_and_analyze_logistic(mod_name, a_formula, train_data, threshold=0.5):
    a_mod = smf.logit(formula=a_formula, data=train_data).fit()

    params = a_mod.params
    pvalues = a_mod.pvalues
    significant_params = params[pvalues < 0.05]

    top_2_features = []
    if len(significant_params) > 0:
        top_2_features = (significant_params ** 2).sort_values(ascending=False).head(3).index.tolist()
    
    train_copy = train_data.copy()
    train_copy['pred_probability'] = a_mod.predict( train_data )
    
    train_copy['pred_class'] = np.where( train_copy.pred_probability > threshold, 1, 0 )

    TN, FP, FN, TP = confusion_matrix( train_copy.binary_outcome.to_numpy(), train_copy.pred_class.to_numpy() ).ravel()
    
    Accuracy = (TN + TP) / (TN + FP + FN + TP)
    Sensitivity = (TP) / (TP + FN)
    Precision = (TP) / (TP + FP) if (TP + FP) > 0 else 0
    Specificity = (TN) / (TN + FP)
    FPR = 1 - Specificity
    F1_Score = 2 * (Precision * Sensitivity) / (Precision + Sensitivity) if (Precision + Sensitivity) > 0 else 0
    ROC_AUC = roc_auc_score( train_copy.binary_outcome.to_numpy(), train_copy.pred_probability.to_numpy() )

    res_dict = {'model_name': mod_name,
                'model_formula': a_formula,
                'num_coefs': len(a_mod.params),
                'num_sign_coefs': len(significant_params),
                'sign_coefs_&_their_values': [significant_params.to_dict()],
                'top_2_features': [top_2_features],
                'Accuracy': Accuracy,
                'Sensitivity': Sensitivity,
                'Precision': Precision,
                'Specificity': Specificity,
                'FPR': FPR,
                'F1 Score': F1_Score,
                'ROC_AUC': ROC_AUC}

    return pd.DataFrame(res_dict, index=[0])

Model 1. Intercept-only or constant average model¶

In [90]:
formula_list[0]
Out[90]:
'binary_outcome ~ 1'
In [91]:
fit_glm_1 = smf.logit(formula=formula_list[0], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.632687
         Iterations 4
In [92]:
fit_glm_1.params
Out[92]:
Intercept   -0.717663
dtype: float64
In [93]:
df_modeling_copy = df_modeling.copy()
In [94]:
df_modeling_copy['pred_probability_M1'] = fit_glm_1.predict(df_modeling)
In [95]:
df_modeling_copy['pred_class_M1'] = np.where(df_modeling_copy.pred_probability_M1 > 0.5, 1, 0)
In [96]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M1, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image

As we can see from the confusion matrix above, the model doesn't classify any songs as "popular".

In [97]:
model_1_results = fit_and_analyze_logistic(mod_name='Model 1', a_formula=formula_list[0], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.632687
         Iterations 4
In [98]:
model_1_results
Out[98]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 1 binary_outcome ~ 1 1 1 {'Intercept': -0.717662608612157} [Intercept] 0.672092 0.0 0 1.0 0.0 0 0.5
In [99]:
df_modeling_copy.pred_probability_M1.value_counts()
Out[99]:
pred_probability_M1
0.327908    25190
Name: count, dtype: int64
In [100]:
df_modeling.binary_outcome.value_counts(normalize=True)
Out[100]:
binary_outcome
0    0.672092
1    0.327908
Name: proportion, dtype: float64

This model serves as a simple baseline. With no predictor variables, it calculates the average likelihood of a song being popular across the entire dataset.

The model's single coefficient, the intercept (approximately -0.72), is statistically significant. This model predicts the same probability of being popular - approximately 33% - for every song. Because this constant predicted probability of 33% is below the 0.5 decision threshold, the model logically classifies every song as "not popular". This directly explains its performance metrics.

  • The accuracy is approximately 0.67, which simply reflects the class imbalance in the dataset; 67% of all songs belong to the "not popular" (majority) class. The model gets them all right by default.
  • The sensitivity is 0. Because the model never predicts a song as popular, it fails to correctly identify any of the truly popular songs.
  • The precision is also 0. Since the model makes no positive predictions, the number of True Positives is zero, resulting in a precision score of 0.
  • The specificity is a perfect 1, and therefore the FPR is 0. This is because the model correctly classifies every truly unpopular song as "not popular".
  • The F1 Score is 0, which is expected as it balances precision and sensitivity. A score of 0 indicates a complete failure to identify the positive class.
  • The ROC AUC is 0.5, which confirms the model has no ability to distinguish between popular and unpopular songs.

Model 2. Categorical inputs with additive features¶

In [101]:
formula_list[1]
Out[101]:
'binary_outcome ~ playlist_genre + key + mode'
In [102]:
fit_glm_2 = smf.logit(formula=formula_list[1], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.615117
         Iterations 6
In [103]:
df_modeling_copy['pred_probability_M2'] = fit_glm_2.predict(df_modeling)
df_modeling_copy['pred_class_M2'] = np.where(df_modeling_copy.pred_probability_M2 > 0.5, 1, 0)
In [104]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M2, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image

Similarly to the intercept-only model, Model 2 does not classify any songs as "popular".

In [105]:
model_2_results = fit_and_analyze_logistic(mod_name='Model 2', a_formula=formula_list[1], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.615117
         Iterations 6
In [106]:
model_2_results
Out[106]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 2 binary_outcome ~ playlist_genre + key + mode 18 7 {'Intercept': -1.633228568540369, 'playlist_ge... [Intercept, playlist_genre[T.pop], playlist_ge... 0.672092 0.0 0 1.0 0.0 0 0.599069

This model, which includes categorical inputs playlist_genre, key, and mode as predictors, has 18 coefficients, with 7 being statistically significant. This indicates that we can be confident that these 7 features have a non-zero effect on a song's popularity and that we can trust the direction of that relationship as indicated by the coefficients' signs. Below is the full list of all statistically significant features and their values:

In [107]:
model_2_results['sign_coefs_&_their_values'].iloc[0]
Out[107]:
{'Intercept': -1.633228568540369,
 'playlist_genre[T.latin]': 1.0789336718505969,
 'playlist_genre[T.pop]': 1.2322868891076435,
 'playlist_genre[T.r&b]': 0.8023634295139332,
 'playlist_genre[T.rap]': 1.1026573373088557,
 'playlist_genre[T.rock]': 1.2030982403071422,
 'key[T.7]': -0.12411197465085606}
In [108]:
model_2_results.top_2_features[0]
Out[108]:
['Intercept', 'playlist_genre[T.pop]', 'playlist_genre[T.rock]']

The two features with the most impactful coefficients are the pop (around 1.23) and rock (around 1.20) genres. These are part of a broader pattern where five genres in total - including rap, latin, and r&b - all have a statistically significant positive association with popularity compared to the edm baseline. This provides strong evidence that a song's genre has a non-zero effect on its popularity.

In [109]:
df_modeling_copy.pred_probability_M2.describe()
Out[109]:
count    25190.000000
mean         0.327908
std          0.084563
min          0.147124
25%          0.299322
50%          0.361415
75%          0.386307
max          0.432853
Name: pred_probability_M2, dtype: float64

The highest predicted probability for any song is approximately 0.43. Because no prediction crosses the 0.5 decision threshold, this model still classifies all songs as "not popular," just like the intercept-only baseline. Consequently, all performance metrics that depend on the final class predictions - Accuracy, Sensitivity, Precision, and F1 Score - are identical to Model 1. For example, the Sensitivity is 0 because no popular songs are correctly identified.

The key improvement is the ROC AUC, which increased from the baseline of 0.5 to roughly 0.60. This shows the model now has a weak but real ability to distinguish between the classes. Although its predicted probabilities don't cross the 0.5 threshold, it has learned to give slightly higher scores to popular songs than to unpopular ones.

Model 3. Continuous inputs with linear additive features¶

In [110]:
formula_list[2]
Out[110]:
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms'
In [111]:
fit_glm_3 = smf.logit(formula=formula_list[2], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.613043
         Iterations 5
In [112]:
df_modeling_copy['pred_probability_M3'] = fit_glm_3.predict(df_modeling)
df_modeling_copy['pred_class_M3'] = np.where(df_modeling_copy.pred_probability_M3 > 0.5, 1, 0)
In [113]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M3, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image

Unlike the previous models, Model 3 correctly classifies some songs as popular, though the number of songs it predicts as popular is much smaller than the total number of truly popular songs.

In [114]:
model_3_results = fit_and_analyze_logistic(mod_name='Model 3', a_formula=formula_list[2], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.613043
         Iterations 5
In [115]:
model_3_results
Out[115]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 3 binary_outcome ~ danceability + energy + loudn... 11 10 {'Intercept': -0.7548636863231273, 'energy': -... [Intercept, energy, loudness] 0.670822 0.027966 0.467611 0.984465 0.015535 0.052776 0.617639
In [116]:
df_modeling_copy.pred_probability_M3.describe()
Out[116]:
count    25190.000000
mean         0.327908
std          0.090986
min          0.054246
25%          0.270384
50%          0.335194
75%          0.390342
max          0.873626
Name: pred_probability_M3, dtype: float64

This model, which includes continuous inputs with linear additive features as predictors, has 11 coefficients, with 10 being statistically significant. This indicates that we can be confident that these 10 audio features have a non-zero effect on a song's popularity and that we can trust the direction of that relationship as indicated by the coefficients' signs. Below is the full list of all statistically significant features and their values:

In [117]:
model_3_results['sign_coefs_&_their_values'].iloc[0]
Out[117]:
{'Intercept': -0.7548636863231273,
 'energy': -0.3011426478648988,
 'loudness': 0.25856875376431365,
 'speechiness': -0.0738259110375911,
 'acousticness': 0.0976045841097421,
 'instrumentalness': -0.23343668704410966,
 'liveness': -0.03467033771999454,
 'valence': 0.04572820472766655,
 'tempo': 0.07046847805346242,
 'duration_ms': -0.15488234284077587}

The two coefficients with the highest magnitude apart from the intercept (around -0.76) are energy (around -0.30) and loudness (around 0.25). This indicates that, after controlling for other factors, an increase in a song's loudness is associated with a higher likelihood of it being popular, while a corresponding increase in energy is associated with a lower likelihood. It's important to note that energy and loudness are moderately correlated (0.68). While this does not impact the model's overall predictive power, the individual coefficients for these two features should be interpreted with caution, as the model may have difficulty separating their unique effects.

A key improvement is that the model's predicted probabilities now go as high as 0.87, so unlike the previous models, it now classifies some songs as "popular." The model's performance reveles a clear trade-off: the model's primary strength is its very high specificity (around 0.98), which means it is excellent at correctly identifying unpopular songs. This results in a low FPR of just 1.6% which means the model incorrectly flags only 1.6% of the truly "unpopular" songs as "popular".

However, this high specificity comes at the cost of extremely low sensitivity (around 0.03), as the model finds only 3% of all truly popular songs. This poor performance on the positive class is also reflected in the Precision (around 0.47), which shows the model is correct less than half the time it predicts "popular," and the very low F1 Score (aroud 0.05), which confirms the model is not well-balanced, as its performance on the positive class is poor.

While the accuracy (around 0.67) is misleading due to class imbalance, the ROC AUC of 0.62 is the most reliable indicator. It confirms this model has a modest but improved ability to distinguish between classes compared to the baseline and Model 2.

Model 4. All inputs (continuous and categorical) with linear additive features¶

In [118]:
formula_list[3]
Out[118]:
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode'
In [119]:
fit_glm_4 = smf.logit(formula=formula_list[3], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.598450
         Iterations 6
In [120]:
df_modeling_copy['pred_probability_M4'] = fit_glm_4.predict(df_modeling)
df_modeling_copy['pred_class_M4'] = np.where(df_modeling_copy.pred_probability_M4 > 0.5, 1, 0)
In [121]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M4, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image

Similar to Model 3, Model 4 correctly classifies some songs as popular. However, it still fails to identify a large portion of the truly popular songs.

In [122]:
model_4_results = fit_and_analyze_logistic(mod_name='Model 4', a_formula=formula_list[3], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.598450
         Iterations 6
In [123]:
model_4_results
Out[123]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 4 binary_outcome ~ danceability + energy + loudn... 28 15 {'Intercept': -1.5917205139340989, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.669075 0.084988 0.474324 0.954046 0.045954 0.144148 0.656647
In [124]:
df_modeling_copy.pred_probability_M4.describe()
Out[124]:
count    25190.000000
mean         0.327908
std          0.119032
min          0.031741
25%          0.241488
50%          0.339460
75%          0.414837
max          0.950219
Name: pred_probability_M4, dtype: float64

This model, which includes all inputs with linear additive features, has 28 coefficients, with 15 being statistically significant. Below is the full list of all statistically significant features and their values:

In [125]:
model_4_results['sign_coefs_&_their_values'].iloc[0]
Out[125]:
{'Intercept': -1.5917205139340989,
 'playlist_genre[T.latin]': 0.8710920712060213,
 'playlist_genre[T.pop]': 1.0828670623658054,
 'playlist_genre[T.r&b]': 0.6345725938683296,
 'playlist_genre[T.rap]': 0.8857224431014801,
 'playlist_genre[T.rock]': 1.4025550042702633,
 'danceability': 0.11759980055645763,
 'energy': -0.2951494814511126,
 'loudness': 0.33056018693286277,
 'speechiness': -0.047355152773171635,
 'acousticness': 0.10518289624176624,
 'instrumentalness': -0.1609632391131655,
 'valence': -0.0459921965724886,
 'tempo': 0.08061844635555175,
 'duration_ms': -0.16591935884430412}
In [126]:
model_4_results.top_2_features[0]
Out[126]:
['Intercept', 'playlist_genre[T.rock]', 'playlist_genre[T.pop]']

Among the model's features, the two with the highest magnitude coefficients are both genres: playlist_genre[T.rock] (around 1.40) and playlist_genre[T.pop] (around 1.08). This shows that even after accounting for a song's specific audio features, its genre has a strong association with its popularity. Compared to the edm baseline genre, songs in the rock and pop categories are significantly more likely to be popular, with rock showing the strongest effect.

We can also see some effects from continuous features. For example, higher loudness (around 0.33) is associated with an increased likelihood of being popular, while higher energy (around -0.30) is associated with a decreased likelihood. This helps to understand of what a "popular" song in this dataset can look like. It is important, however, to interpret the individual strength of these two effects with caution, as energy and loudness are moderately correlated in the data.

This model shows a clear, though modest, improvement in performance over previous models. The ROC AUC increased to around 0.66, indicating better overall ability to distinguish between classes.

The most significant improvement is in the model's ability to find positive cases. The sensitivity increased to 0.085 and the F1 Score rose to 0.14. While still low, their values are higher than in Model 3. At the same time, this improvement comes with a trade-off, as specificity dropped to around 0.95, which means the model makes slightly more false positive errors to find more true positives. The accuracy (around 0.67) remains a misleading metric due to the class imbalance.

Model 5. Continuous inputs with linear main effect and pair-wise interactions¶

In [127]:
formula_list[4]
Out[127]:
'binary_outcome ~ (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)**2'
In [128]:
fit_glm_5 = smf.logit(formula=formula_list[4], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.604714
         Iterations 6
In [129]:
df_modeling_copy['pred_probability_M5'] = fit_glm_5.predict(df_modeling)
df_modeling_copy['pred_class_M5'] = np.where(df_modeling_copy.pred_probability_M5 > 0.5, 1, 0)
In [130]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M5, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image
In [131]:
model_5_results = fit_and_analyze_logistic(mod_name='Model 5', a_formula=formula_list[4], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.604714
         Iterations 6
In [132]:
model_5_results
Out[132]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 5 binary_outcome ~ (danceability + energy + loud... 56 27 {'Intercept': -0.7532051680487379, 'energy': -... [Intercept, energy, loudness] 0.675347 0.065375 0.541082 0.972947 0.027053 0.116656 0.637941

This model adds significant complexity by including all possible two-way interactions between the continuous features. It has 56 coefficients, with 27 being statistically significant. Below is the full list of all statistically significant features and their values:

In [133]:
model_5_results['sign_coefs_&_their_values'].iloc[0]
Out[133]:
{'Intercept': -0.7532051680487379,
 'energy': -0.3106834076116351,
 'loudness': 0.2793634428566775,
 'speechiness': -0.09220691713082607,
 'acousticness': 0.09373444081455799,
 'instrumentalness': -0.2620295078594359,
 'liveness': -0.033590554154820436,
 'valence': 0.03973166789264538,
 'tempo': 0.051175350580958866,
 'duration_ms': -0.14365658029877337,
 'danceability:speechiness': 0.05272747462445708,
 'danceability:acousticness': 0.07878215297136672,
 'danceability:instrumentalness': -0.06562398956563534,
 'danceability:liveness': 0.035630246008381584,
 'danceability:duration_ms': -0.06043953932417578,
 'energy:loudness': -0.1234989177067231,
 'energy:speechiness': -0.05519997122774107,
 'energy:valence': 0.04692717948986716,
 'loudness:acousticness': -0.06783866101599377,
 'loudness:instrumentalness': -0.10289914832890394,
 'loudness:valence': 0.06118481775443594,
 'speechiness:valence': -0.050890582008801595,
 'speechiness:tempo': 0.03383822265464645,
 'speechiness:duration_ms': -0.033908687164200556,
 'acousticness:valence': -0.05890538265405652,
 'acousticness:tempo': 0.036702183815766,
 'acousticness:duration_ms': 0.04642356234542932}

The two statistically significant features with the highest magnitude are the and energy (around -0.31) and loudness (around 0.28). The key new finding is the presence of numerous significant interaction terms. For example, the significant negative interaction between energy and loudness suggests that the relationship is not simple; the positive effect of a song being loud may be reduced if the song is also very energetic. Similarly to Model 3 and 4, it is important here to interpret the individual strength of the effects energy and loudness with caution, as they are moderately correlated.

Despite the added complexity, this model's performance did not improve over the simpler Model 4. The ROC AUC decreased slightly to around 0.64, and the F1 Score also fell to around 0.12. While precision slightly increased to 0.54, sensitivity dropped to around 0.065, meaning the model is now even worse at identifying popular songs.

Model 6. Interaction between categorical and continuous inputs (including main effects)¶

In [134]:
formula_list[5]
Out[134]:
'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)'
In [135]:
fit_glm_6 = smf.logit(formula=formula_list[5], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.588672
         Iterations 6
In [136]:
df_modeling_copy['pred_probability_M6'] = fit_glm_6.predict(df_modeling)
df_modeling_copy['pred_class_M6'] = np.where(df_modeling_copy.pred_probability_M6 > 0.5, 1, 0)
In [137]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M6, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image
In [138]:
model_6_results = fit_and_analyze_logistic(mod_name='Model 6', a_formula=formula_list[5], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.588672
         Iterations 6
In [139]:
model_6_results
Out[139]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 36 {'Intercept': -1.4692198939587524, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.68162 0.159927 0.549958 0.936149 0.063851 0.247796 0.678517

Model 6, which includes interaction between categorical and continuous inputs (including main effects), has 198 coefficients, with 36 of them being statistically significant. Below is the full list of all statistically significant features and their values:

In [140]:
model_6_results['sign_coefs_&_their_values'].iloc[0]
Out[140]:
{'Intercept': -1.4692198939587524,
 'playlist_genre[T.latin]': 0.7129044076828355,
 'playlist_genre[T.pop]': 0.9697884006457378,
 'playlist_genre[T.r&b]': 0.5410788520732197,
 'playlist_genre[T.rap]': 0.683108840753402,
 'playlist_genre[T.rock]': 1.0648959049386686,
 'playlist_genre[T.pop]:danceability': 0.16260997561869103,
 'playlist_genre[T.rap]:danceability': 0.32166916627430203,
 'energy': -0.30098777260781245,
 'key[T.5]:energy': 0.28262384287694486,
 'key[T.10]:energy': 0.2675946672055558,
 'loudness': 0.22268063960026133,
 'playlist_genre[T.latin]:loudness': 0.308053273645286,
 'playlist_genre[T.pop]:loudness': 0.2308701655335041,
 'playlist_genre[T.r&b]:loudness': 0.31156778841244914,
 'key[T.5]:loudness': -0.2276773532572789,
 'playlist_genre[T.pop]:speechiness': 0.141346409361902,
 'key[T.4]:speechiness': -0.14989435832592057,
 'key[T.5]:speechiness': -0.15820907196729422,
 'key[T.7]:speechiness': -0.13415630717420718,
 'acousticness': 0.17087436844976686,
 'playlist_genre[T.pop]:acousticness': -0.1691862440953791,
 'playlist_genre[T.r&b]:acousticness': -0.15272166298122875,
 'playlist_genre[T.rock]:acousticness': -0.29354175255229853,
 'instrumentalness': -0.19774618445032627,
 'key[T.11]:instrumentalness': 0.16289813001903056,
 'playlist_genre[T.rock]:liveness': -0.17882547702557045,
 'mode[T.1]:liveness': 0.08418855173409909,
 'playlist_genre[T.r&b]:valence': -0.3002306480191845,
 'playlist_genre[T.rap]:valence': -0.21132225970349322,
 'duration_ms': -0.13982442258401978,
 'playlist_genre[T.latin]:duration_ms': 0.24701972934680702,
 'playlist_genre[T.rock]:duration_ms': 0.2239857676255843,
 'key[T.8]:duration_ms': -0.17156041337994027,
 'key[T.9]:duration_ms': -0.17927816063665708,
 'key[T.11]:duration_ms': -0.16265612640241334}
In [141]:
model_6_results.top_2_features[0]
Out[141]:
['Intercept', 'playlist_genre[T.rock]', 'playlist_genre[T.pop]']

The two statistically significant features with the highest magnitude are the and playlist_genre[T.rock] (around 1.07) and playlist_genre[T.pop] (around 0.97). We also see many significant interaction terms in this model. This suggests that the effect of an audio feature (such as danceability, etc.) on a track's popularity is not universal and it may be changing depending on the song's genre and musical key.

This model shows better performance than Models 1-5. The ROC AUC improved to 0.68, and the F1 Score saw a substantial increase to 0.25, indicating a much better balance between precision and sensitivity. With sensitivity improving to around 0.16, this model is better at identifying popular songs. At the same time specificity decreased to 0.94, so the model makes more false positive errors to capture more true positives.

Model 7. Additional model #1: Interaction between playlist_genre and a targeted subset of continuous features¶

In [142]:
formula_list[6]
Out[142]:
'binary_outcome ~ playlist_genre * (danceability + energy + speechiness + instrumentalness + valence + tempo)'
In [143]:
fit_glm_7 = smf.logit(formula=formula_list[6], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.603147
         Iterations 6
In [144]:
df_modeling_copy['pred_probability_M7'] = fit_glm_7.predict(df_modeling)
df_modeling_copy['pred_class_M7'] = np.where(df_modeling_copy.pred_probability_M7 > 0.5, 1, 0)
In [145]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M7, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image
In [146]:
model_7_results = fit_and_analyze_logistic(mod_name='Model 7', a_formula=formula_list[6], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.603147
         Iterations 6
In [147]:
model_7_results
Out[147]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 7 binary_outcome ~ playlist_genre * (danceabilit... 42 21 {'Intercept': -1.3781378385906253, 'playlist_g... [Intercept, playlist_genre[T.pop], playlist_ge... 0.671338 0.033172 0.483245 0.982693 0.017307 0.062082 0.642028

Model 7, which includes interaction between playlist_genre and six continuous input variables, has 42 coefficients, half of which are statistically significant. Below is the full list of all statistically significant features and their values:

In [148]:
model_7_results['sign_coefs_&_their_values'].iloc[0]
Out[148]:
{'Intercept': -1.3781378385906253,
 'playlist_genre[T.latin]': 0.7102180337256131,
 'playlist_genre[T.pop]': 0.9740358514667813,
 'playlist_genre[T.r&b]': 0.4004436020311392,
 'playlist_genre[T.rap]': 0.7063598154232721,
 'playlist_genre[T.rock]': 0.9671501258864602,
 'playlist_genre[T.pop]:danceability': 0.20939032310783862,
 'playlist_genre[T.rap]:danceability': 0.3628786991383162,
 'playlist_genre[T.rock]:danceability': 0.16626109811198622,
 'energy': -0.12705703967201093,
 'playlist_genre[T.pop]:speechiness': 0.14540488968728038,
 'instrumentalness': -0.29720189685181725,
 'playlist_genre[T.pop]:instrumentalness': -0.11078544361206202,
 'playlist_genre[T.rap]:instrumentalness': 0.2842712561135303,
 'playlist_genre[T.rock]:instrumentalness': 0.16163091120174888,
 'valence': 0.13730002794055163,
 'playlist_genre[T.latin]:valence': -0.1417282127383832,
 'playlist_genre[T.pop]:valence': -0.14974500479916417,
 'playlist_genre[T.r&b]:valence': -0.39258136414365397,
 'playlist_genre[T.rap]:valence': -0.2682237316365914,
 'playlist_genre[T.rock]:valence': -0.16352402014638404}
In [149]:
model_7_results.top_2_features[0]
Out[149]:
['Intercept', 'playlist_genre[T.pop]', 'playlist_genre[T.rock]']

The main effects for playlist_genre[T.pop] (around 0.97) and playlist_genre[T.rock] (around 0.97) remain the features with the highest magnitude. The significant interaction terms reveal more specific relationships; for example, the positive coefficient for playlist_genre[T.rap]:danceability suggests that danceability has a stronger positive effect on popularity for rap songs compared to the edm genre (reference group).

This simpler model represents a significant decrease in performance from the more comprehensive interaction model (Model 6). The ROC AUC dropped to 0.64, and the F1 Score fell sharply to 0.06. This is driven by the sensitivity returning to a low level of 0.03, indicating the model has poor ability to identify popular songs. This suggests that the features and interactions removed from this model were important for its predictive power.

Model 8. Additional model #2: cyclical (sine) tempo feature interacting with with playlist_genre interaction, plus main effects of other key audio features¶

In [150]:
formula_list[7]
Out[150]:
'binary_outcome ~ playlist_genre * np.sin(tempo) + danceability + energy + acousticness'
In [151]:
fit_glm_8 = smf.logit(formula=formula_list[7], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.611414
         Iterations 6
In [152]:
df_modeling_copy['pred_probability_M8'] = fit_glm_8.predict(df_modeling)
df_modeling_copy['pred_class_M8'] = np.where(df_modeling_copy.pred_probability_M8 > 0.5, 1, 0)
In [153]:
fig, ax = plt.subplots()

sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M8, margins=True),
            annot=True, annot_kws={'size': 20}, fmt='3d')

plt.show()
No description has been provided for this image
In [154]:
model_8_results = fit_and_analyze_logistic(mod_name='Model 8', a_formula=formula_list[7], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.611414
         Iterations 6
In [155]:
model_8_results
Out[155]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
0 Model 8 binary_outcome ~ playlist_genre * np.sin(tempo... 15 9 {'Intercept': -1.5459646472786424, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.671616 0.007143 0.453846 0.995806 0.004194 0.014064 0.618848

Model 8 was designed to test for a cyclical relationship between a song's tempo and its popularity. It has 15 coefficients, 9 of which are statistically significant. Below is the full list of all statistically significant features and their values:

In [156]:
model_8_results['sign_coefs_&_their_values'].iloc[0]
Out[156]:
{'Intercept': -1.5459646472786424,
 'playlist_genre[T.latin]': 0.9230082780788055,
 'playlist_genre[T.pop]': 1.153610253068739,
 'playlist_genre[T.r&b]': 0.653154735102794,
 'playlist_genre[T.rap]': 0.9299190469592341,
 'playlist_genre[T.rock]': 1.2374639439449069,
 'danceability': 0.09675689671802463,
 'energy': -0.06217633612842594,
 'acousticness': 0.13552851281491385}
In [157]:
model_8_results.top_2_features[0]
Out[157]:
['Intercept', 'playlist_genre[T.rock]', 'playlist_genre[T.pop]']

The key finding is that this model's central hypothesis was not supported by the data, as the cyclical tempo feature and its interactions were not statistically significant. Instead, the two most impactful predictors were the main effects for the genres, with rock (around 1.24) and pop (around 1.15) again showing the strongest positive association with popularity.

The performance of this model is worse than other interaction models. The ROC AUC of 0.62 is low, and the model almost completely fails to identify popular songs, as shown by the near-zero sensitivity (around 0.007) and F1 Score (0.014). The low performance of the model confirms that this exploratory approach was not successful.

Comparing the models' performance on the training set¶

In [158]:
results_list = []

for m in range(len(formula_list)):
    
    results_list.append( fit_and_analyze_logistic(m+1, formula_list[m], train_data = df_modeling, threshold = 0.5) )
Optimization terminated successfully.
         Current function value: 0.632687
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.615117
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.613043
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.598450
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.604714
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.588672
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.603147
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.611414
         Iterations 6
In [159]:
results_df = pd.concat(results_list, ignore_index=True)
In [160]:
results_df.sort_values(by=['Accuracy'], ascending=False)
Out[160]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
5 6 binary_outcome ~ (playlist_genre + key + mode)... 198 36 {'Intercept': -1.4692198939587524, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.681620 0.159927 0.549958 0.936149 0.063851 0.247796 0.678517
4 5 binary_outcome ~ (danceability + energy + loud... 56 27 {'Intercept': -0.7532051680487379, 'energy': -... [Intercept, energy, loudness] 0.675347 0.065375 0.541082 0.972947 0.027053 0.116656 0.637941
0 1 binary_outcome ~ 1 1 1 {'Intercept': -0.717662608612157} [Intercept] 0.672092 0.000000 0.000000 1.000000 0.000000 0.000000 0.500000
1 2 binary_outcome ~ playlist_genre + key + mode 18 7 {'Intercept': -1.633228568540369, 'playlist_ge... [Intercept, playlist_genre[T.pop], playlist_ge... 0.672092 0.000000 0.000000 1.000000 0.000000 0.000000 0.599069
7 8 binary_outcome ~ playlist_genre * np.sin(tempo... 15 9 {'Intercept': -1.5459646472786424, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.671616 0.007143 0.453846 0.995806 0.004194 0.014064 0.618848
6 7 binary_outcome ~ playlist_genre * (danceabilit... 42 21 {'Intercept': -1.3781378385906253, 'playlist_g... [Intercept, playlist_genre[T.pop], playlist_ge... 0.671338 0.033172 0.483245 0.982693 0.017307 0.062082 0.642028
2 3 binary_outcome ~ danceability + energy + loudn... 11 10 {'Intercept': -0.7548636863231273, 'energy': -... [Intercept, energy, loudness] 0.670822 0.027966 0.467611 0.984465 0.015535 0.052776 0.617639
3 4 binary_outcome ~ danceability + energy + loudn... 28 15 {'Intercept': -1.5917205139340989, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.669075 0.084988 0.474324 0.954046 0.045954 0.144148 0.656647

The most complex model - Model 6 - stands as the most accurate. However, when evaluating model performance, the ROC AUC is the most reliable metric for this dataset due to the significant class imbalance (67% of songs are unpopular). Because accuracy can be misleading, so I'm going to focus on the ROC AUC score.

In [161]:
results_df.sort_values(by=['ROC_AUC'], ascending=False)
Out[161]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
5 6 binary_outcome ~ (playlist_genre + key + mode)... 198 36 {'Intercept': -1.4692198939587524, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.681620 0.159927 0.549958 0.936149 0.063851 0.247796 0.678517
3 4 binary_outcome ~ danceability + energy + loudn... 28 15 {'Intercept': -1.5917205139340989, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.669075 0.084988 0.474324 0.954046 0.045954 0.144148 0.656647
6 7 binary_outcome ~ playlist_genre * (danceabilit... 42 21 {'Intercept': -1.3781378385906253, 'playlist_g... [Intercept, playlist_genre[T.pop], playlist_ge... 0.671338 0.033172 0.483245 0.982693 0.017307 0.062082 0.642028
4 5 binary_outcome ~ (danceability + energy + loud... 56 27 {'Intercept': -0.7532051680487379, 'energy': -... [Intercept, energy, loudness] 0.675347 0.065375 0.541082 0.972947 0.027053 0.116656 0.637941
7 8 binary_outcome ~ playlist_genre * np.sin(tempo... 15 9 {'Intercept': -1.5459646472786424, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.671616 0.007143 0.453846 0.995806 0.004194 0.014064 0.618848
2 3 binary_outcome ~ danceability + energy + loudn... 11 10 {'Intercept': -0.7548636863231273, 'energy': -... [Intercept, energy, loudness] 0.670822 0.027966 0.467611 0.984465 0.015535 0.052776 0.617639
1 2 binary_outcome ~ playlist_genre + key + mode 18 7 {'Intercept': -1.633228568540369, 'playlist_ge... [Intercept, playlist_genre[T.pop], playlist_ge... 0.672092 0.000000 0.000000 1.000000 0.000000 0.000000 0.599069
0 1 binary_outcome ~ 1 1 1 {'Intercept': -0.717662608612157} [Intercept] 0.672092 0.000000 0.000000 1.000000 0.000000 0.000000 0.500000
In [162]:
def fit_logistic_make_roc (mod_name, a_formula, train_data):
    a_mod = smf.logit(formula=a_formula, data=train_data).fit()

    train_copy = train_data.copy()

    train_copy['pred_probability'] = a_mod.predict(train_data)

    fpr, tpr, threshold = roc_curve(train_data.binary_outcome.to_numpy(), train_copy.pred_probability.to_numpy())

    res_df = pd.DataFrame({'tpr': tpr,
                           'fpr': fpr,
                           'threshold': threshold})

    res_df['model_name'] = mod_name
    res_df['model_formula'] = a_formula

    return res_df
In [163]:
roc_list = []

for m in range( len(formula_list) ):
    roc_list.append( fit_logistic_make_roc(m+1, formula_list[m], train_data=df_modeling) )
Optimization terminated successfully.
         Current function value: 0.632687
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.615117
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.613043
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.598450
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.604714
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.588672
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.603147
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.611414
         Iterations 6
In [164]:
roc_df = pd.concat(roc_list, ignore_index=True)
In [165]:
roc_df['model_name'] = roc_df.model_name.astype('category')
roc_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64115 entries, 0 to 64114
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   tpr            64115 non-null  float64 
 1   fpr            64115 non-null  float64 
 2   threshold      64115 non-null  float64 
 3   model_name     64115 non-null  category
 4   model_formula  64115 non-null  object  
dtypes: category(1), float64(3), object(1)
memory usage: 2.0+ MB
In [166]:
sns.relplot(data=roc_df, x='fpr', y='tpr', hue='model_name',
            kind='line', estimator=None, units='model_name',
            col='model_name', col_wrap=4)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The ROC curve for the intercept-only Model 1 is a 45-degree diagonal line, which is the expected result for a baseline model with no discriminative ability. Because it assigns the same probability to every song, it cannot distinguish between the positive and negative classes, resulting in an AUC of 0.5.

In stark contrast, Model 6, the most complex model with 198 coefficients, has the curviest line - it is pushed furthest towards the top - left corner of the plot. This visually confirms it has the best performance on the training data, achieving the highest True Positive Rate for any given False Positive Rate, and thus the highest ROC AUC score. The other models" curves fall in between these two extremes, generally showing improved performance as more features and complexity are added.

Which model has the best performance on the training set?¶

Based on the ROC AUC, Model 6 (the most complex model with the highest number of coefficients) has the best performance on the training set with a score of around 0.68. This indicates it has the strongest ability to distinguish between popular and unpopular songs out of all models tested.

Is the best model different when considering Accuracy vs ROC AUC?¶

No, in this case, the best model is the same for both metrics. Model 6 also has the highest accuracy. However, ROC AUC is the better measure of performance here, as the high accuracy score is inflated by the model's ability to correctly guess the majority "unpopular" class.

Is the best model better than the INTERCEPT-ONLY model?¶

Yes, the best model is significantly better. Model 6's ROC AUC of roughly 0.68 demonstrates a clear improvement in predictive power over the intercept-only model's score of 0.5, which is equivalent to random chance.

How many coefficients are associated with the BEST model?¶

The best-performing model, Model 6, is the most complicated one and has 198 coefficients.

D. Models: Predictions¶

Creating the input grid¶

The model with all inputs and linear additive features is Model 4 and the model that peformed best on the trainig set is Model 6. I will use them for predictions on the new data.

In [167]:
model_4_results.model_formula[0]
Out[167]:
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode'
In [168]:
model_6_results.model_formula[0]
Out[168]:
'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)'
In [169]:
model_4_results['sign_coefs_&_their_values'].iloc[0]
Out[169]:
{'Intercept': -1.5917205139340989,
 'playlist_genre[T.latin]': 0.8710920712060213,
 'playlist_genre[T.pop]': 1.0828670623658054,
 'playlist_genre[T.r&b]': 0.6345725938683296,
 'playlist_genre[T.rap]': 0.8857224431014801,
 'playlist_genre[T.rock]': 1.4025550042702633,
 'danceability': 0.11759980055645763,
 'energy': -0.2951494814511126,
 'loudness': 0.33056018693286277,
 'speechiness': -0.047355152773171635,
 'acousticness': 0.10518289624176624,
 'instrumentalness': -0.1609632391131655,
 'valence': -0.0459921965724886,
 'tempo': 0.08061844635555175,
 'duration_ms': -0.16591935884430412}
In [170]:
model_6_results['sign_coefs_&_their_values'].iloc[0]
Out[170]:
{'Intercept': -1.4692198939587524,
 'playlist_genre[T.latin]': 0.7129044076828355,
 'playlist_genre[T.pop]': 0.9697884006457378,
 'playlist_genre[T.r&b]': 0.5410788520732197,
 'playlist_genre[T.rap]': 0.683108840753402,
 'playlist_genre[T.rock]': 1.0648959049386686,
 'playlist_genre[T.pop]:danceability': 0.16260997561869103,
 'playlist_genre[T.rap]:danceability': 0.32166916627430203,
 'energy': -0.30098777260781245,
 'key[T.5]:energy': 0.28262384287694486,
 'key[T.10]:energy': 0.2675946672055558,
 'loudness': 0.22268063960026133,
 'playlist_genre[T.latin]:loudness': 0.308053273645286,
 'playlist_genre[T.pop]:loudness': 0.2308701655335041,
 'playlist_genre[T.r&b]:loudness': 0.31156778841244914,
 'key[T.5]:loudness': -0.2276773532572789,
 'playlist_genre[T.pop]:speechiness': 0.141346409361902,
 'key[T.4]:speechiness': -0.14989435832592057,
 'key[T.5]:speechiness': -0.15820907196729422,
 'key[T.7]:speechiness': -0.13415630717420718,
 'acousticness': 0.17087436844976686,
 'playlist_genre[T.pop]:acousticness': -0.1691862440953791,
 'playlist_genre[T.r&b]:acousticness': -0.15272166298122875,
 'playlist_genre[T.rock]:acousticness': -0.29354175255229853,
 'instrumentalness': -0.19774618445032627,
 'key[T.11]:instrumentalness': 0.16289813001903056,
 'playlist_genre[T.rock]:liveness': -0.17882547702557045,
 'mode[T.1]:liveness': 0.08418855173409909,
 'playlist_genre[T.r&b]:valence': -0.3002306480191845,
 'playlist_genre[T.rap]:valence': -0.21132225970349322,
 'duration_ms': -0.13982442258401978,
 'playlist_genre[T.latin]:duration_ms': 0.24701972934680702,
 'playlist_genre[T.rock]:duration_ms': 0.2239857676255843,
 'key[T.8]:duration_ms': -0.17156041337994027,
 'key[T.9]:duration_ms': -0.17927816063665708,
 'key[T.11]:duration_ms': -0.16265612640241334}

In both models, the three most important statistically significant inputs are playlist_genre, energy, and loudness. When comparing the main effects for energy and loudness across both Model 4 and Model 6, the loudness coefficient in Model 4 has the single highest magnitude (around 0.33). Therefore, it was selected as the most impactful continuous feature of the two.

In [171]:
input_grid = pd.DataFrame([(danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, 
                            duration_ms, playlist_genre, key, mode) 
                            for danceability in [df_modeling.danceability.mean()]
                            for energy in np.linspace(df_modeling.energy.min(), df_modeling.energy.max(), num=5)
                            for loudness in np.linspace(df_modeling.loudness.min(), df_modeling.loudness.max(), num=101)
                            for speechiness in [df_modeling.speechiness.mean()]
                            for acousticness in [df_modeling.acousticness.mean()]
                            for instrumentalness in [df_modeling.instrumentalness.mean()]
                            for liveness in [df_modeling.liveness.mean()]
                            for valence in [df_modeling.valence.mean()]
                            for tempo in [df_modeling.tempo.mean()]
                            for duration_ms in [df_modeling.duration_ms.mean()]
                            for playlist_genre in df_modeling.playlist_genre.unique()
                            for key in df_modeling['key'].mode()
                            for mode in df_modeling['mode'].mode()],
                         columns=['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
                            'valence', 'tempo', 'duration_ms', 'playlist_genre', 'key', 'mode'])
In [172]:
input_grid.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3030 entries, 0 to 3029
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      3030 non-null   float64
 1   energy            3030 non-null   float64
 2   loudness          3030 non-null   float64
 3   speechiness       3030 non-null   float64
 4   acousticness      3030 non-null   float64
 5   instrumentalness  3030 non-null   float64
 6   liveness          3030 non-null   float64
 7   valence           3030 non-null   float64
 8   tempo             3030 non-null   float64
 9   duration_ms       3030 non-null   float64
 10  playlist_genre    3030 non-null   object 
 11  key               3030 non-null   int64  
 12  mode              3030 non-null   int64  
dtypes: float64(10), int64(2), object(1)
memory usage: 307.9+ KB
In [173]:
input_grid.nunique()
Out[173]:
danceability          1
energy                5
loudness            101
speechiness           1
acousticness          1
instrumentalness      1
liveness              1
valence               1
tempo                 1
duration_ms           1
playlist_genre        6
key                   1
mode                  1
dtype: int64
In [174]:
input_grid.playlist_genre.value_counts()
Out[174]:
playlist_genre
pop      505
rap      505
rock     505
latin    505
r&b      505
edm      505
Name: count, dtype: int64

Making predictions on the new data¶

In [175]:
dfviz = input_grid.copy()
In [176]:
dfviz['pred_probability_M4'] = fit_glm_4.predict(input_grid)
In [177]:
dfviz['pred_probability_M6'] = fit_glm_6.predict(input_grid)
In [178]:
dfviz
Out[178]:
danceability energy loudness speechiness acousticness instrumentalness liveness valence tempo duration_ms playlist_genre key mode pred_probability_M4 pred_probability_M6
0 1.810911e-16 -3.754985 -6.293843 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 pop 1 1 0.190371 0.053369
1 1.810911e-16 -3.754985 -6.293843 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 rap 1 1 0.161821 0.202214
2 1.810911e-16 -3.754985 -6.293843 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 rock 1 1 0.244547 0.168505
3 1.810911e-16 -3.754985 -6.293843 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 latin 1 1 0.159846 0.041464
4 1.810911e-16 -3.754985 -6.293843 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 r&b 1 1 0.130574 0.035059
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3025 1.810911e-16 1.631404 10.720326 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 rap 1 1 0.916037 0.847017
3026 1.810911e-16 1.631404 10.720326 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 rock 1 1 0.948167 0.943913
3027 1.810911e-16 1.631404 10.720326 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 latin 1 1 0.914904 0.995504
3028 1.810911e-16 1.631404 10.720326 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 r&b 1 1 0.894592 0.994818
3029 1.810911e-16 1.631404 10.720326 4.377778e-16 -1.669874e-16 1.015464e-16 -1.545762e-16 1.083162e-16 3.430012e-16 -8.123712e-17 edm 1 1 0.818163 0.835694

3030 rows × 15 columns

Visualizing event probability¶

The line plot below shows the probability estimated by Model 4 on the new data (input_grid). It is colored by 5 distinct values of energy and each facet represents a playlist_genre.

In [179]:
sns.relplot(data=dfviz, x='loudness', y='pred_probability_M4', hue='energy', col='playlist_genre',
            kind='line', estimator=None, units='energy', palette='icefire', 
            col_wrap=3)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

For each playlist genre, the lines are upward-sloping, meaning that as loudness goes up, the probability of a track being popular increases. We can also see that as energy goes down, the probability of the track being popular increases. The shape of the curve and the spacing between the lines representing different energy levels are generally consistent across genres. This shows that Model 4 assumes the effect of each feature is the same regardless of the song's genre.

In [180]:
model_4_results['sign_coefs_&_their_values'].iloc[0]
Out[180]:
{'Intercept': -1.5917205139340989,
 'playlist_genre[T.latin]': 0.8710920712060213,
 'playlist_genre[T.pop]': 1.0828670623658054,
 'playlist_genre[T.r&b]': 0.6345725938683296,
 'playlist_genre[T.rap]': 0.8857224431014801,
 'playlist_genre[T.rock]': 1.4025550042702633,
 'danceability': 0.11759980055645763,
 'energy': -0.2951494814511126,
 'loudness': 0.33056018693286277,
 'speechiness': -0.047355152773171635,
 'acousticness': 0.10518289624176624,
 'instrumentalness': -0.1609632391131655,
 'valence': -0.0459921965724886,
 'tempo': 0.08061844635555175,
 'duration_ms': -0.16591935884430412}

The plot below shows the event probabilities predicted by Model 6 on the new data. Each line corresponds to a different energy level, and the facets separate the predictions across the six playlist genres.

In [181]:
sns.relplot(data=dfviz, x='loudness', y='pred_probability_M6', hue='energy', col='playlist_genre',
           kind='line', estimator=None, units='energy', palette='icefire',
           col_wrap=3)

plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Similar to Model 4, the plots for Model 6 show that higher loudness and lower energy are generally associated with a higher probability of being popular. However, the key difference is that the shapes of these relationships now change for each genre.

The steepness of the lines varies noticeably - they are steepest for pop, indicating that loudness has the strongest positive impact on popularity in that genre, but are flattest for rap, meaning loudness has a much weaker effect for rap music.

We also observe larger differences in the spacing between the lines. For instance, the distance is largest for the rap genre and shortest for pop. This shows that the negative effect of energy on popularity is most pronounced for rap and least impactful for pop.

The main trend is that Model 6 learns how an audio feature's importance changes for each specific genre, as opposed to Model 4's more rigid approach.

In the plots for Model 6, we can see that for some genres, like pop, latin, and r&b, the predicted probabilities get much closer to 1 compared to other genres, like edm. This suggests that the model is much more certain in predictions for those three genres. Model 4, in contrast, doesn't show the same level of variation in certainty. The varied reliability of Model 6 allows us to trust its predictions more for some categories than for others.

In [182]:
model_6_results['sign_coefs_&_their_values'].iloc[0]
Out[182]:
{'Intercept': -1.4692198939587524,
 'playlist_genre[T.latin]': 0.7129044076828355,
 'playlist_genre[T.pop]': 0.9697884006457378,
 'playlist_genre[T.r&b]': 0.5410788520732197,
 'playlist_genre[T.rap]': 0.683108840753402,
 'playlist_genre[T.rock]': 1.0648959049386686,
 'playlist_genre[T.pop]:danceability': 0.16260997561869103,
 'playlist_genre[T.rap]:danceability': 0.32166916627430203,
 'energy': -0.30098777260781245,
 'key[T.5]:energy': 0.28262384287694486,
 'key[T.10]:energy': 0.2675946672055558,
 'loudness': 0.22268063960026133,
 'playlist_genre[T.latin]:loudness': 0.308053273645286,
 'playlist_genre[T.pop]:loudness': 0.2308701655335041,
 'playlist_genre[T.r&b]:loudness': 0.31156778841244914,
 'key[T.5]:loudness': -0.2276773532572789,
 'playlist_genre[T.pop]:speechiness': 0.141346409361902,
 'key[T.4]:speechiness': -0.14989435832592057,
 'key[T.5]:speechiness': -0.15820907196729422,
 'key[T.7]:speechiness': -0.13415630717420718,
 'acousticness': 0.17087436844976686,
 'playlist_genre[T.pop]:acousticness': -0.1691862440953791,
 'playlist_genre[T.r&b]:acousticness': -0.15272166298122875,
 'playlist_genre[T.rock]:acousticness': -0.29354175255229853,
 'instrumentalness': -0.19774618445032627,
 'key[T.11]:instrumentalness': 0.16289813001903056,
 'playlist_genre[T.rock]:liveness': -0.17882547702557045,
 'mode[T.1]:liveness': 0.08418855173409909,
 'playlist_genre[T.r&b]:valence': -0.3002306480191845,
 'playlist_genre[T.rap]:valence': -0.21132225970349322,
 'duration_ms': -0.13982442258401978,
 'playlist_genre[T.latin]:duration_ms': 0.24701972934680702,
 'playlist_genre[T.rock]:duration_ms': 0.2239857676255843,
 'key[T.8]:duration_ms': -0.17156041337994027,
 'key[T.9]:duration_ms': -0.17927816063665708,
 'key[T.11]:duration_ms': -0.16265612640241334}

E. Models: Performance and Validation¶

Select models¶

In [183]:
results_df.sort_values(by=['num_coefs'], ascending=False)
Out[183]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
5 6 binary_outcome ~ (playlist_genre + key + mode)... 198 36 {'Intercept': -1.4692198939587524, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.681620 0.159927 0.549958 0.936149 0.063851 0.247796 0.678517
4 5 binary_outcome ~ (danceability + energy + loud... 56 27 {'Intercept': -0.7532051680487379, 'energy': -... [Intercept, energy, loudness] 0.675347 0.065375 0.541082 0.972947 0.027053 0.116656 0.637941
6 7 binary_outcome ~ playlist_genre * (danceabilit... 42 21 {'Intercept': -1.3781378385906253, 'playlist_g... [Intercept, playlist_genre[T.pop], playlist_ge... 0.671338 0.033172 0.483245 0.982693 0.017307 0.062082 0.642028
3 4 binary_outcome ~ danceability + energy + loudn... 28 15 {'Intercept': -1.5917205139340989, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.669075 0.084988 0.474324 0.954046 0.045954 0.144148 0.656647
1 2 binary_outcome ~ playlist_genre + key + mode 18 7 {'Intercept': -1.633228568540369, 'playlist_ge... [Intercept, playlist_genre[T.pop], playlist_ge... 0.672092 0.000000 0.000000 1.000000 0.000000 0.000000 0.599069
7 8 binary_outcome ~ playlist_genre * np.sin(tempo... 15 9 {'Intercept': -1.5459646472786424, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.671616 0.007143 0.453846 0.995806 0.004194 0.014064 0.618848
2 3 binary_outcome ~ danceability + energy + loudn... 11 10 {'Intercept': -0.7548636863231273, 'energy': -... [Intercept, energy, loudness] 0.670822 0.027966 0.467611 0.984465 0.015535 0.052776 0.617639
0 1 binary_outcome ~ 1 1 1 {'Intercept': -0.717662608612157} [Intercept] 0.672092 0.000000 0.000000 1.000000 0.000000 0.000000 0.500000
In [184]:
formula_list[5]
Out[184]:
'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)'
In [185]:
formula_list[2]
Out[185]:
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms'
In [186]:
formula_list[3]
Out[186]:
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode'

I selected three models for cross-validation:

  • Model 6 - the model that has the best performance on the training set. It includes interaction between categorical and continuous inputs (including main effects)
  • Model 3 - a model that includes continuous inputs with linear additive features has just a few features (11)
  • Model 4 - a model that includes all inputs (continuous and categorical) with linear additive features. It's medium-to-high complexity model with 28 features.

Perform cross-validation and calculate performance metrics¶

In [187]:
from sklearn.model_selection import StratifiedKFold

I will use 5-fold cross-validation.

In [188]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=101)

I will fit the models with statsmodels.

In Section D at the preprocessing stage, I created a dataset df_numeric_inputs that includes only continuous input variables. I also performed log and cube root transformations on those variables whose distributions were strongly skewed. Thus, df_numeric_inputs is a dataset which already contains log/cube root transformed numeric input variables.

df_copy is a cleaned dataset containing all variables (including those that were not used in the analysis).

I will create the dataset df_CV, a list input_names, and an object output name that will be used in the arguments for the formula that will help streamline the process of cross-validation, standardization, and performance metrics calculation.

In [189]:
df_CV = df_numeric_inputs.copy()
In [190]:
df_CV['playlist_genre'] = df_copy.playlist_genre
df_CV['mode'] = df_copy['mode']
df_CV['key'] = df_copy.key
df_CV['binary_outcome'] = df_copy.binary_outcome.astype(int)
In [191]:
input_names = df_CV.drop(columns=['binary_outcome']).columns.tolist() 
input_names
Out[191]:
['danceability',
 'energy',
 'loudness',
 'speechiness',
 'acousticness',
 'instrumentalness',
 'liveness',
 'valence',
 'tempo',
 'duration_ms',
 'playlist_genre',
 'mode',
 'key']
In [192]:
output_name = 'binary_outcome'
In [193]:
df_CV.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25190 entries, 3 to 32832
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      25190 non-null  float64
 1   energy            25190 non-null  float64
 2   loudness          25190 non-null  float64
 3   speechiness       25190 non-null  float64
 4   acousticness      25190 non-null  float64
 5   instrumentalness  25190 non-null  float64
 6   liveness          25190 non-null  float64
 7   valence           25190 non-null  float64
 8   tempo             25190 non-null  float64
 9   duration_ms       25190 non-null  int64  
 10  playlist_genre    25190 non-null  object 
 11  mode              25190 non-null  object 
 12  key               25190 non-null  object 
 13  binary_outcome    25190 non-null  int32  
dtypes: float64(9), int32(1), int64(1), object(3)
memory usage: 3.3+ MB

The function below automates the process of model training and evaluation by:

  1. Splitting the dataset into five folds,
  2. Standardizing continuous variables within each fold,
  3. Fitting a logistic regression model on the training set,
  4. Predicting outcomes for the test set, and
  5. Calculating performance metrics, including accuracy, sensitivity, precision, specificity, FPR, F1 score, and ROC AUC.
In [194]:
def train_test_and_asses_logistic_with_cv(mod_name, a_formula, data_df, x_names, y_name, cv, threshold=0.5):
    # Initialize a list for each performance metric to store the results from each fold
    accuracy_res = []
    sensitivity_res = []
    precision_res = []
    specificity_res = []
    fpr_res = []
    f1_res = []
    roc_auc_res = []

    # Get the names of the continuous variables that need standardizing
    continuous_vars = data_df[x_names].select_dtypes(include=np.number).columns.tolist()

    # Split the data and iterate over the folds
    for train_id, test_id in cv.split( data_df.to_numpy(), data_df[y_name].to_numpy() ):
        # subset the training and test splits within each fold
        train_data = data_df.iloc[train_id].copy()
        test_data = data_df.iloc[test_id].copy()

        # Standardize the data within each split
        scaler = StandardScaler()
        # Fit the scaler only on the training data to avoid data leakage
        scaler.fit(train_data[continuous_vars])
        # Use the fitted scaler to transform both the training and testing data
        train_data[continuous_vars] = scaler.transform(train_data[continuous_vars])
        test_data[continuous_vars] = scaler.transform(test_data[continuous_vars])

        # Fit the model on the training data within the current fold
        a_mod = smf.logit(formula=a_formula, data=train_data).fit()

        # Predict the test dataset within each fold
        test_copy = test_data.copy()
        test_copy['pred_probability'] = a_mod.predict(test_data)
        test_copy['pred_class'] = np.where(test_copy.pred_probability > threshold, 1, 0)

        # Calculate all performance metrics on the testing set
        TN, FP, FN, TP = confusion_matrix(test_copy[y_name].to_numpy(), test_copy.pred_class.to_numpy() ).ravel()

        Accuracy = (TN + TP) / (TN + FP + FN + TP) 
        Sensitivity = TP / (TP + FN) 
        Precision = TP / (TP + FP) if (TP + FP) > 0 else 0
        Specificity = TN / (TN + FP)
        FPR = 1 - Specificity
        F1_Score = 2 * (Precision * Sensitivity) / (Precision + Sensitivity) if (Precision + Sensitivity) > 0 else 0
        ROC_AUC = roc_auc_score(test_copy[y_name].to_numpy(), test_copy.pred_probability.to_numpy())

        # Append the results for this fold to our lists
        accuracy_res.append(Accuracy)
        sensitivity_res.append(Sensitivity)
        precision_res.append(Precision)
        specificity_res.append(Specificity)
        fpr_res.append(FPR)
        f1_res.append(F1_Score)
        roc_auc_res.append(ROC_AUC)

    # Bookkeeping to store the results
    results_dict = {'model_name': mod_name,
                    'model_formula': a_formula,
                    'num_coefs': len(a_mod.params),
                    'fold_id': list(range(1, len(accuracy_res) + 1)),
                    'Accuracy': accuracy_res,
                    'Sensitivity': sensitivity_res,
                    'Precision': precision_res,
                    'Specificity': specificity_res,
                    'FPR': fpr_res,
                    'F1 Score': f1_res,
                    'ROC AUC': roc_auc_res}
    
    # Convert the dictionary to a DataFrame
    test_df = pd.DataFrame(results_dict)
    
    return test_df
In [195]:
M3_CV = train_test_and_asses_logistic_with_cv('Model 3', formula_list[2], data_df=df_CV, x_names=input_names, y_name=output_name, cv=kf, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.612025
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.612084
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.613683
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.612960
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.614287
         Iterations 5
In [196]:
M4_CV = train_test_and_asses_logistic_with_cv('Model 4', formula_list[3], data_df=df_CV, x_names=input_names, y_name=output_name, cv=kf, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.597774
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.596139
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.599665
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.598338
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.599724
         Iterations 6
In [197]:
M6_CV = train_test_and_asses_logistic_with_cv('Model 6', formula_list[5], data_df=df_CV, x_names=input_names, y_name=output_name, cv=kf, threshold=0.5)
Optimization terminated successfully.
         Current function value: 0.586741
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.585734
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.589036
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.588041
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.589298
         Iterations 6
In [198]:
CV_results = pd.concat([M3_CV, M4_CV, M6_CV], ignore_index=True)
In [199]:
CV_results
Out[199]:
model_name model_formula num_coefs fold_id Accuracy Sensitivity Precision Specificity FPR F1 Score ROC AUC
0 Model 3 binary_outcome ~ danceability + energy + loudn... 11 1 0.672291 0.033898 0.504505 0.983757 0.016243 0.063528 0.606414
1 Model 3 binary_outcome ~ danceability + energy + loudn... 11 2 0.667130 0.026634 0.389381 0.979622 0.020378 0.049858 0.605217
2 Model 3 binary_outcome ~ danceability + energy + loudn... 11 3 0.673283 0.029056 0.533333 0.987596 0.012404 0.055109 0.621767
3 Model 3 binary_outcome ~ danceability + energy + loudn... 11 4 0.670107 0.024818 0.445652 0.984938 0.015062 0.047018 0.617876
4 Model 3 binary_outcome ~ danceability + energy + loudn... 11 5 0.672489 0.024213 0.512821 0.988777 0.011223 0.046243 0.634369
5 Model 4 binary_outcome ~ danceability + energy + loudn... 28 1 0.666733 0.084746 0.456026 0.950679 0.049321 0.142930 0.649198
6 Model 4 binary_outcome ~ danceability + energy + loudn... 28 2 0.659389 0.083535 0.405882 0.940343 0.059657 0.138554 0.638116
7 Model 4 binary_outcome ~ danceability + energy + loudn... 28 3 0.675069 0.082930 0.528958 0.963969 0.036031 0.143380 0.664914
8 Model 4 binary_outcome ~ danceability + energy + loudn... 28 4 0.670306 0.098668 0.486567 0.949203 0.050797 0.164066 0.655030
9 Model 4 binary_outcome ~ danceability + energy + loudn... 28 5 0.669710 0.084746 0.479452 0.955109 0.044891 0.144033 0.663753
10 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 1 0.666733 0.149516 0.474088 0.919079 0.080921 0.227335 0.652764
11 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 2 0.666534 0.141646 0.471774 0.922623 0.077377 0.217877 0.650286
12 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 3 0.677848 0.140436 0.533333 0.940047 0.059953 0.222329 0.671652
13 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 4 0.676856 0.168281 0.522556 0.924985 0.075015 0.254579 0.664146
14 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 5 0.674276 0.154358 0.511022 0.927939 0.072061 0.237099 0.673748

Visualizing the cross-validation results¶

Accuracy¶

In [200]:
sns.catplot(data=CV_results, x='model_name', y='Accuracy', kind='point', join=False)

plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\2990118830.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data=CV_results, x='model_name', y='Accuracy', kind='point', join=False)
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [201]:
df_CV.binary_outcome.value_counts(normalize=True)
Out[201]:
binary_outcome
0    0.672092
1    0.327908
Name: proportion, dtype: float64

While Model 6 has the highest mean value of accuracy, its 95% CI largely overlaps with other models, indicating no statistically significant difference. More importantly, because the dataset is imbalanced (over 67% of songs are unpopular), accuracy is a misleading metric. Therefore, I will focus on ROC AUC, which provides a more reliable measure of a model's ability to distinguish between classes, regardless of the classification threshold.

ROC AUC¶

In [202]:
sns.catplot(data=CV_results, x='model_name', y='ROC AUC', kind='point', join=False)

plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\932363862.py:1: UserWarning: 

The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`.

  sns.catplot(data=CV_results, x='model_name', y='ROC AUC', kind='point', join=False)
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The cross-validation results show that all three models have an average ROC AUC score significantly above 0.5, indicating they all possess predictive power better than random chance. While Model 6 has the highest mean ROC AUC, its 95% confidence interval has a substantial overlap with that of Model 4. This suggests there is no statistically significant difference in performance between these two models.

In [203]:
CV_results
Out[203]:
model_name model_formula num_coefs fold_id Accuracy Sensitivity Precision Specificity FPR F1 Score ROC AUC
0 Model 3 binary_outcome ~ danceability + energy + loudn... 11 1 0.672291 0.033898 0.504505 0.983757 0.016243 0.063528 0.606414
1 Model 3 binary_outcome ~ danceability + energy + loudn... 11 2 0.667130 0.026634 0.389381 0.979622 0.020378 0.049858 0.605217
2 Model 3 binary_outcome ~ danceability + energy + loudn... 11 3 0.673283 0.029056 0.533333 0.987596 0.012404 0.055109 0.621767
3 Model 3 binary_outcome ~ danceability + energy + loudn... 11 4 0.670107 0.024818 0.445652 0.984938 0.015062 0.047018 0.617876
4 Model 3 binary_outcome ~ danceability + energy + loudn... 11 5 0.672489 0.024213 0.512821 0.988777 0.011223 0.046243 0.634369
5 Model 4 binary_outcome ~ danceability + energy + loudn... 28 1 0.666733 0.084746 0.456026 0.950679 0.049321 0.142930 0.649198
6 Model 4 binary_outcome ~ danceability + energy + loudn... 28 2 0.659389 0.083535 0.405882 0.940343 0.059657 0.138554 0.638116
7 Model 4 binary_outcome ~ danceability + energy + loudn... 28 3 0.675069 0.082930 0.528958 0.963969 0.036031 0.143380 0.664914
8 Model 4 binary_outcome ~ danceability + energy + loudn... 28 4 0.670306 0.098668 0.486567 0.949203 0.050797 0.164066 0.655030
9 Model 4 binary_outcome ~ danceability + energy + loudn... 28 5 0.669710 0.084746 0.479452 0.955109 0.044891 0.144033 0.663753
10 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 1 0.666733 0.149516 0.474088 0.919079 0.080921 0.227335 0.652764
11 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 2 0.666534 0.141646 0.471774 0.922623 0.077377 0.217877 0.650286
12 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 3 0.677848 0.140436 0.533333 0.940047 0.059953 0.222329 0.671652
13 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 4 0.676856 0.168281 0.522556 0.924985 0.075015 0.254579 0.664146
14 Model 6 binary_outcome ~ (playlist_genre + key + mode)... 198 5 0.674276 0.154358 0.511022 0.927939 0.072061 0.237099 0.673748

Which model is the BEST according to CROSS-VALIDATION?¶

To determine the best model, I will focus on the ROC AUC score. Given that the dataset is imbalanced (over 67% of songs being unpopular), Accuracy is a misleading metric. The ROC AUC provides a more reliable assessment of a model's ability to discriminate between classes.

The point plot of the cross-validation results shows that Model 6 has the highest average ROC AUC. However, its 95% confidence interval substantially overlaps with that of Model 4. This overlap indicates that there is no statistically significant difference in performance between the two models.

Therefore, following the the "simplest best" rule, Model 4 is the best model according to cross-validation. While its performance is similar to the much more complex Model 6, it achieves this with only 28 coefficients compared to Model 6's 198 features. This makes Model 4 less prone to overfitting and more easily interpretable.

Is this model DIFFERENT from the model identified as the BEST according to the training set?¶

Yes, the best model according to the cross-validation is Model 4 while the best model according to the training set is Model 6.

How many regression coefficients are associated with the best model?¶

Model 4 has 28 coefficients.

Differences between the best model's performance on the training set vs cross-validation¶

To see how performance of Model 4 changed from the training set to the cross-validation, I will compare its training set metrics with the averages for all metrics across all 5 folds.

In [204]:
results_df[results_df.model_name == 4]
Out[204]:
model_name model_formula num_coefs num_sign_coefs sign_coefs_&_their_values top_2_features Accuracy Sensitivity Precision Specificity FPR F1 Score ROC_AUC
3 4 binary_outcome ~ danceability + energy + loudn... 28 15 {'Intercept': -1.5917205139340989, 'playlist_g... [Intercept, playlist_genre[T.rock], playlist_g... 0.669075 0.084988 0.474324 0.954046 0.045954 0.144148 0.656647
In [205]:
model_4_results = CV_results[CV_results['model_name'] == 'Model 4']
In [206]:
M4_average_performance_CV = model_4_results[['Accuracy', 'Sensitivity', 'Precision', 'Specificity', 'FPR', 'F1 Score', 'ROC AUC']].mean()
M4_average_performance_CV
Out[206]:
Accuracy       0.668241
Sensitivity    0.086925
Precision      0.471377
Specificity    0.951861
FPR            0.048139
F1 Score       0.146593
ROC AUC        0.654202
dtype: float64

The slight decrease in metrics like ROC AUC and Accuracy during cross-validation is indicates a minor degree of overfitting. We also see some small fluctuations in the other metrics which is expected. The model performed slightly better on the data it had already seen, and the cross-validation score is a more reliable assessment of its predictive ability.

Applying the best model on the entire dataset¶

To see the final set of coefficients for Model 4, which cross-validation identified as the best model, I re-fitted it on the entire preprocessed df_modeling dataset.

In [207]:
best_model = smf.logit(formula=formula_list[3], data=df_modeling).fit()
Optimization terminated successfully.
         Current function value: 0.598450
         Iterations 6
In [208]:
best_model.params
Out[208]:
Intercept                 -1.591721
playlist_genre[T.latin]    0.871092
playlist_genre[T.pop]      1.082867
playlist_genre[T.r&b]      0.634573
playlist_genre[T.rap]      0.885722
playlist_genre[T.rock]     1.402555
key[T.1]                   0.003998
key[T.2]                  -0.069688
key[T.3]                  -0.094121
key[T.4]                  -0.032777
key[T.5]                   0.072833
key[T.6]                   0.016002
key[T.7]                  -0.086394
key[T.8]                   0.094110
key[T.9]                  -0.018013
key[T.10]                  0.060207
key[T.11]                 -0.031106
mode[T.1]                  0.029470
danceability               0.117600
energy                    -0.295149
loudness                   0.330560
speechiness               -0.047355
acousticness               0.105183
instrumentalness          -0.160963
liveness                  -0.021375
valence                   -0.045992
tempo                      0.080618
duration_ms               -0.165919
dtype: float64

The statistically significant coefficients, ranked by the magnitude of their effect, are shown below. We see that playlist_genre, loudness, and energy have the biggest effect. However, we can also see that the effect of the playlist_genre is not universal across genres and are bigger for rock and pop compared to the edm baseline.

In [209]:
np.abs(best_model.params[best_model.pvalues < 0.05]).sort_values(ascending=False)
Out[209]:
Intercept                  1.591721
playlist_genre[T.rock]     1.402555
playlist_genre[T.pop]      1.082867
playlist_genre[T.rap]      0.885722
playlist_genre[T.latin]    0.871092
playlist_genre[T.r&b]      0.634573
loudness                   0.330560
energy                     0.295149
duration_ms                0.165919
instrumentalness           0.160963
danceability               0.117600
acousticness               0.105183
tempo                      0.080618
speechiness                0.047355
valence                    0.045992
dtype: float64